>> Jim Larus: Our first speaker is Gad Sheaffer... >> Gad Sheaffer: Hi. Am I on the...

advertisement
>> Jim Larus: Our first speaker is Gad Sheaffer from Intel.
>> Gad Sheaffer: Hi. Am I on the speaker? Okay. So I like talking about an
activity at Intel we're calling design for user experience. I'm a architect in the
Israelee design at Intel's Haifa center in Israel. I'm working on the future
[inaudible] that will come out say later. I can't completely talk about it. [laughter].
And but we'll talk about five to 10 year horizon for applications for that process.
So this exercise, this design for user experience is about naturally looking for
professor enhancements because that's what we were looking for that allow for
meaning fully enhancing the user experience.
We're to go that for a change. We're doing it top down search. We really are
starting with usage models and going town to applications and the [inaudible].
We really want to do is optimize our upcoming design for the future applications
rather than past, right? What currently happens is that we typically design a
processor for five years out with applications that were 12 years -- that were
traced 12 years back and comprise the bulk of our applications. So we want to
sort of change it.
However, we do not plan to actually develop these applications. We are looking
forward to seek collaboration with people who have these application at hand
and which -- with which we could collaborate and tune the hardware and
software in concert.
So we are looking under some street lights. We're just trying to have the street
light different -- be different from the ones they used to look under. So we are
looking for generic or focused performance improvement and enhancements. I
mean, there are many other qualities of PC platforms that could be enhanced.
We're looking at performance.
Client -- focusing mainly on client platforms five to ten years out and mostly
processor. But we are also amenable to cheap set and other enhancement.
Streets we are trying not to look under. In the past we've been targeting our
SSE, ISA and such as we've been looking mainly at more and more video
encoding an transcoding from different formats with every going resolution. It
gets a bit boring after a few generations.
And we are basically looking at the application that would appeal -- that appeal to
the average Intel engineer.
Features that address the needs of US households with large incomes like this
terabytes of movies in the basement that you transcode into send to the 50 inch
screens in each room. That's not a very common occurrence in many other
rooms out of the US.
And perhaps some of the applications that we're looking -- some of the
applications that we're looking at are maybe things that are -- that user is not
even aware of application, things that happen in the background. They are not
really necessarily shrink wrapped.
And we are looking at many usage models that do not have the user actually sit
behind the keyboard and the screen. So we tried this for -- this is the first time I
think we have done this. I think Microsoft are doing similar things, but this is the
first for us. We tried to actually look at the focus on the daily lives of real people
from diverse geographies, age groups and economic situations. We have the
Intel ethnography teams identified based on market data and the very
sociological trends. We had the ethnography people actually identify people and
they have actually met and spend some time with these people. So these are
not meta types, these are really real people that have names and faces and we -I haven't personally met them but I have seen their pictures. I know a lot about
their lives and other people, the ethnographic people actually had spent a few
days with them at their homes and work places.
So we think we know what they might want and what could be -- what they would
value. And some of these people represent markets and usages not commonly
thought of as computer users. And but they could represent growth
opportunities. I can give an example of most of -- he's a [inaudible] from
Morocco has his own business. He has no computers inside but he has a pretty
good cell phone which he's using to run his business so just as an example.
So we started with ethnographic data and market research that sort of loud us to
pick the personas. Based on these personas we tried to develop scenarios.
Basically we did the day in the life of these people five years out. And we tried to
translate those into system capabilities.
So this is how these exercise looks like. We have a group of five to seven
people sitting around the room, around the table, their persona. One of those
people actually knows the person that we -- that is being discussed and spent
some time with them. And they actually prepared in advance, they took pictures
and they actually prepared in advance a set of small images of these are hours
and these are various phases of people's lives, the entertainment, work, family
interaction and so forth and so forth.
And then during the day, we basically try to come up with ideas about where
technology could positively intersect with their life in a beneficial fashion. Okay?
So we take these dual -- we write these ideas on these post it notes and stick
them up to the relevant time of the day and the activity. Some of it is patently
bogus, but some of it is real, right? So I can give an example.
So any way, let's talk a little bit about day in the life of MK. He's a Korean airline
executive. He's married, has three children, a dog, no life. [laughter].
He works -- gets up early, works while being driven to work, so he can't really
afford the time to drive himself, so he actually works while being driven. Swims
in the corporate swimming pool. Works. Driven to business lunches, back to
work. Works while being driven home. Sleeps. That's how his week looks like.
So here is MK. You see only two kids here because the third one is in the US.
They think that that I will get a expert education. And so and this is how the
board looks like, right, where the -- so where is MK swimming? Anyway, you can
see works, sort of spend most of his activities during the day.
And so we basically sketched things like these are the various activities of his
day. Being driven to work, to work, back home, business meeting, and this
business meeting I mean typically meeting people who he has never met before.
And one of the usage model we try to come up with what we're calling a
charminator, basically something that would be like a icebreaker that you could
basically -- the system would tell him some things about the person he's meeting
and be able to sort of start the discussion of -- start the conversation.
For example, the only free time that is not actively working during the day while
he was -- was when he was swimming, so we tried to come up with the things
that we can do with him while he was swimming, right?
So swim coach and the monitor. So we had a hard time actually so a video
camera would track his movements but we had a hard time figuring out how to
actually provide him with feedback while he was swimming. I mean, it's a bit
tricky.
>>: [inaudible].
>> Gad Sheaffer: But the water changes so we need to do -- so we need to do
some image modifications so that it looks right rectified even while ->>: [inaudible].
>> Gad Sheaffer: Yeah. So -- yeah. So anyway. Oh, we could put some ear
phones on him and read him some e-mail while he was swimming. [laughter].
Anyway, so this methodology we tried to augment with various market trend
reports from various sources. The problem with that is that if you ask users
today what do they care about, then they care about mostly they care about
[inaudible] right? It's really hard to get people to actually associate value to
things they have not really experienced or not even aware -- not even aware
could be done, all right. Especially also for analyst this is marked data from
various -- from [inaudible] or something like that. It's really hard to get analysts to
think six or seven years out. Really.
Plus always what they tell you is not always what you wanted to hear, right? So
that's a problem, too.
And so anyway, the applications, how do we select application based on these
usage models? Obviously a processor performance and features are key to
enabling these, to making these applications happen. Better that they not be
dependent on the unknown algorithms. And that's why some of the natural
language processing usage models are sort of bit suspect in my view.
Hopefully they need to happen in the time window that we're discussing, so that
means they need to exist in some form today, in this prototype form of some -maybe the performance today is not satisfactory, maybe with 10 extra
performance it gets to say from one frame per second to 20 frames per second
which is acceptable, things like that. But they need to be available, you need to
demo them even not in realtime today.
And ideally you would want these applications would not happen with -- that they
would require sort of a big step function performance or some feature to make
them happen, right, not just five to 10 to 15 percent a year compound growth rate
that we have today, but we need to do something proactive to make these
applications.
And that's nice to have, but still, I mean, starting this year we have an integrated
capabilities for graphics and video processing, focussed performance capabilities
and it just so happens that many of those applications are about video and 3D
but it would be nice if we can sort of leverage these focused performance
pictures.
So the top usage models that we have identified, the new usage models, mostly
about computer vision. And I think that a lot of people are sort of in agreement
with that. I've seen a lot of computer vision work being done over the last -- in
this workshop. Telepresence is something that resonated very well with a lot of
users both from business purposes and for personal reasons. It's a very
fragmented world we live in. So it needs to be affordable, realistic and immersive
teleconferencing both from home and office.
Semantic search, natural language processor -- processing basically semantic
understanding and inference, personalized and in context. And physics, photo
realistic modeling. I'll talk a little bit more about each of those.
So computer vision. So the usages that we imagine were searching, using a
template or a face. Inferring where the picture was taken. Recognition,
identification, and tracking of people. I'm not sure -- wasn't sure about this one.
Understanding gesture, a gaze, eye movement, head movement in conjunction
with other multimodalities. Interactive games, authentication based on face
recognition. E-fitting. Picture enhancement, et cetera, et cetera.
And for telepresence we are look at affordable, realistic, immersive
teleconferencing. We're mainly interested in people actually communicating and
not about enabling people to be in a different place and interacting with say
inanimate objects. It's mostly about people meeting people without -- and getting
an experience that these similar to face-to-face meeting that has the same trust,
building capabilities as the face-to-face meeting.
So currently at Intel at least we have since Intel is a very geographically diverse
company, we have telepresence rooms that are extremely expensive but are
pretty good. And we're using them. And they are very beneficial. But these are
-- we're looking at to provide these half a million and above per room features for
at basically on a notebook or desktop computer.
Close to realistic. So you should feel as if the other person is in the same
environment. Immersive. Immersive which means that video is spatial and both
video and audio are spatial, which means if I'm in the meeting and I'm looking
over there at that person over there then everybody else is aware that I'm looking
over there, right, and he knows that we are sort of establishing eye contact.
Right? So it's -- that's one of the key components of making this feel as close as
possible to a face-to-face meeting. Right? Establishing to correct eye contact.
Low latency is an obvious thing.
Having this surround feeling. We've been thinking about getting this virtual
meeting space that could be configured and you could be set -- set it up as a
lecture hall or as a round table or as a living room, right. And in those you'd be
able to decide where you sit, next to whom you sit. You may be able to whisper
in somebody's ear and establish a side channel of -- while have been else is
aware that you're doing, right? True eye to eye contact is an important
capabilities.
Today when you're looking at doing telepresence in the over here and I'm looking
at a screen, the other person gets the feeling that I'm staring at his navel. So
that's a bit disconcerting.
Today another problem today with telepresence. All systems are not
interoperable, right? If you have a Cisco system you can talk with other Cisco
Systems. If you have a [inaudible] system, that's all we can do. So open
standards. If we can establish some standards here, that would be good.
Bandwidth and quality of service. Many of the systems that we have today
require at least lines of high latency, high bandwidth, dedicated lines. And we
are trying to trade bandwidth for compute if possible, which may be -- maybe
we'll need to instead of passing multiple high bandwidth videos, we'll be able to
pass a texture that will be overlaid on two avatars on the other hand and that's a
way to compress significantly.
And multiparty with multiple participants.
Semantic search, so that's something that maybe I think a lot of people here can
sympathize with. All of us are inundated with more information than we can
digest. So ideally I would like to get one, ask a question and get just single
answer that is based on my past history, my interests, my -- what's relevant for
me in this context and in this time.
Multi-language automatic translation. Remember we discussed people from
knowledge geographies. Not all of them have interaction with other people from
other countries, but many of them actually do. And this aspect of augmented
reality basically that if you're in a meeting or in the teleconference you get
complementary material that is needed for the task at hand.
Personalized. And all above is a function of who I am and what I'm doing, why
I'm asking and where I am, et cetera, et cetera.
Physics modeling is also a promising -- is a component of -- this is really not a
usage model, it's a component of a number of usage models. Photo realistic,
human body, face, hair, and other material objects. Able to model. Advice cost
city, inertia, elasticity and to be able to interact with physical objects as if they are
real objects. Virtual shopping and try-on clothing in virtual space, I think this is
something that we're starting to see that today but it's very really -- not very
realistic. Virtual worlds. And possibly Haptics output.
So I'd like to give an example of how we operate. So we've described usage
models which we're trying to break into detailed scenarios, system capabilities,
algorithms workloads and kernels, primitives.
So for example, multimedia search has these components, face recognition,
object instance class detection and ranking, which translate in turn to
classification, feature extraction, inference, regression and so forth. And in the
end we basically want to have in the end are actually kernels that preferably
highly optimized scales that we can actually optimize.
So in summary, a process of design is optimized so a specific metric. You excel
at what you measure, right? And we want this measure to be applications that
are forward looking, not things that were traced 20 years before the process
actually gets to the market.
It would be nice if these were applications that people actually cared about. We
are looking for components of these applications in a relatively performance
optimized version. And fortunately from what I've seen here, those versions are
typically [inaudible] versions but -- and we'd like to be able to quantify not only
these components but also the value of movements to each of those in the
overall grand scheme of things, of how they would contribute to the overall usage
model.
And for that we are actively seeking collaboration with software vendors for the -trying to do this hardware software c-design to jumpstart this brave new world of
applications that all of us are dreaming of. Okay. Thank you.
[applause].
>> Tony Keaveny: Okay. So I'm delighted to have a chance to talk to you about
what we're doing in Berkeley on the health applications project as part of the big
ParLab project. And the title I chose carefully here, personalized medicine has a
very fixed meaning out there. Typically it's associated with genetic analysis, so
you do some kind of geno analysis on a sample of your tissue and that's
hopefully going to identify your future risk of developing some disease or ideally
your future chance of responding to a certain type of drug treatment.
So everybody's very excited about this. A lot of money's been pumped into it.
Put there's certainly a lot of hype. And the way I see the field of personalized
medicine going is there will be that genetic component, but imaging is going to be
a huge component because imaging is where you're at now. And think of the
genetic analysis as a -- some kind of vector of where you may go, right? But
there is most diseases we're worried about are so complex that the genetic
factors alone really have a small influence on your overall risk of developing the
disease. An image of where you are now is like a summation of all the things
that have happened in the past, and you get a snapshot of the current event.
So I think personalized medicine and medical imaging is going to be big in the
future, and the key to it is going to be assessing the information that's in the
medical image, and that's where all the computation comes in. So that's what I
want to talk about.
My background is I'm a mechanical engineer. I do all my work in biomechanics.
And I use a lot of computation but I don't do any coding and you'll see that as I
progress. Jim Demel [phonetic] is here to handle those questions.
So I see the world through my biomechanics doing else. And the vision then is to
take medical images, could be any type, CT scan, MRI scan would be the more
interesting ones because they are 3D, couple that with computation at many
levels, we have image processing, we have number crunching, but bring into it
from my perspective biomechanics. And the biomechanics is really what
connects me with the clinical function.
For example, if I'm interested in osteoporosis, I need to know something about
how bones break. And I want to extract that information from a medical image.
So in my world, I bring in clinical function through the biomechanics and you try
and couple all of that together and the outcome then can be improved diagnosis,
what's your risk of having say hip fracture, what's your risk of having a
cardiovascular events. Surgical planning, I'm going to do spine surgeon on you,
what implant should I use for you and what is the best place I should put that
implant in your body? And generally patient management. Okay. You're on a
drug treatment, how are you responding to the drug treatment? And we want to
get feedback as soon as possible.
So all of these things will come on it of imaging combined with our computation
biomechanics. Let me give you just a little bit of background on the
biomechanics perspective. These are some common diseases that have very
large biomechanical components in their etiology. So cardiovascular disease
obviously related to plod flow. Arthritis, your joints wear away. There is a clear
effect there of body weight and level of activity on developing arthritis.
Osteoporosis is all about bones breaking. Chronic back pain. It's -- we don't
even know really what the source of chronic back pain is, but it's clearly
associated with mechanical loads placed on the back. And repetitive injuries, I
haven't put a cost on that because it's so big. Repetitive injuries are the number
one source of worker compensation claims in the United States. It used to be
low back pain. It now repetitive injury.
Typically it's from typing. Now, think about this. You've got these extremely
subtle repetitive movements of your fingers are causing the cells to freak out and
start secreting all these enzymes which start destroying the nerves and all the
soft tissue say and you're carpal tunneled. Very, very subtle issue going on here,
and it's all biomechanics, coupled with the biology.
So these are all areas where if we can combine medical imaging and
biomechanics, I think we can make big advances. And it will really change the
future of medicine.
So here is an example of and a success story in this from what we've been doing
previous to this ParLab project in the bone area. One you just saw there is a
little model of a small piece of trabecular bone, very high resolution image. That
was about a five millimeter piece of bone and it was being compressed. It has
about 5 million degrees of freedom. And those models we make from micro-CT
scans and we can also apply this same micro-CT technology on cadaver bones
to analyze the whole vertebra. So this is a cross-section of a human vertebra,
and you're looking at the in red the regions of high stress as you compress the
vertebra.
So you can see there's some very discreet patterns of low transfer through from
the top to the bottom of the verdict bra. And that model has 500 million degrees
of freedom, right? So very, very large models. And we run these are non-linear
material properties and we use big supercomputers to do that.
For this problem on the right hand side we won the Gordon Bell Prize for
scaleability of the software that was used for that. We used software called
Athena, and I'll tell you a little bit more about that later.
So that's the state of the art in numerical computation in the world of structural
mechanics. And we wanted to take that and bring it to the clinic, right? So we
want to use this on granny who might have osteoporosis or there are many of
you who will have problems with your bones in the future. It is a certainty, right?
Men and women. So pay attention.
So in 2003, we published a paper where we -- where we based used a much
coarser version of what you just saw. So these are continual elements. And the
model here each element is about one millimeter in size. What you just saw, that
detailed also cube model was a 4 millimeter cubed. It had a million elements in
it, okay. Okay. So we've homogenized a lot of the problem and came up with
these coarser models. 2003 this made the cover of one of these research
articles and they predicted that we would be able to predict vertebral strain in
patients. And there was a lot of excitement. And based on that excitement we
started a company. The company is called ON Diagnostics. And the purpose of
this company is to take this kind of virtual stress tests from the lab to clinical
practice and orthopedics and bone strength is the first application.
So here we are in 2008. We also made the cover of a very clinical journal,
arthritis and rheumatism, so all the medical people are reading this. Engineers
never read that journal. And it just really chuckled me that we've got a finite
element model on a clinical journal. That's just great. It's just so -- I just was so
happy with that. But the significance here it's the same looking type of model,
right, but we'd gone from the lab into actual patients, right?
And here's a study, we do this on the spine, we do it on the hip. Here is results
from the study on the corner there. There's fracture surveillance study of 6,000
men over age 65. It's called the Mr. Os study, osteoporosis in men study. And
about seven years ago all those men had CT scans taken of their bones, and we
just follow them over time to see who is going to fracture. So at the time of this
analysis -- in fact, we had CT scans of only half the co-hosts. So 3,000, 3,500
men had CT scans tracked for a few years.
So at the time of where we got this data from, there was 40 men had fractured,
40 men with the CT scans had fractured. We have their baseline CT scans, so
we analyzed those 40 scans of the guys who fractured and 210 others randomly
taken from the sample. We're blinded, so we don't know who fractured, who
didn't, and we made our prediction of this strength of their femur using models
just like this. And then this is plotted against the bone mineral density of those
men. That's a clinical test. It's called a Dexa Scan, standard clinical test to
access bone density.
So what you can see in that graph is the light blue points are the men who
fractured. And if you go by currently clinical standard, the express the bone
mineral density as a -- they call it a T score. It's the number of standard
deviations your bone density is below a 30 year old's, right? So it's just a linear
transformation of the data here.
This -- these people on the left of this with the -- would be considered to have
osteoporosis, and they would be put on a drug treatment in clinical practice. And
all the others would be considered either low bone mass or normal and we're not
going to do anything with them, right?
So you do capture a bunch of these folks who fractured, but small number. So
on the strength side, we found that everybody who had a bone strength below
3,000 Newtons broke their hip. And many of those who broke their hip were with
what we considered low bone mass. Now, in reality of the one and a half million
people who have osteoporotic fractures in the US every year, the majority of
those people to not have osteoporosis, right. If you have osteoporosis, you are a
high risk of fracture, put most people who fracture don't have osteoporosis,
because there's just so many people in the non osteoporosis category.
And what you see here is then is we're picking up of the people who we predicted
would fracture, over half of them did not have osteoporosis. So this is very
promising. There's a lot of excitement over that. We're going for FDA approval
later this year. And if everything goes well, this will be available next year if you
ever want it.
And then --
>>: [inaudible] put these people on the same treatment that you give to the
normal osteoporotic individuals that they will not fracture?
>> Tony Keaveny: Okay. So you ->>: [inaudible]. [laughter].
>> Tony Keaveny: So the idea is that identify people so you can put them on
drug treatment that will stop the loss of bone, even build new bone and therefore
strengthen their bones and reduce the risk of fractures.
So the current drug treatments we have reduce the risk of fracture by 50 percent.
So the problem with osteoporosis, as I'll show you in the next slide, the problem
with osteoporosis is it's an undertreated and underdiagnosed disease. And
there's many diseases like that. This is a little table. This is the estimate of the
amount of -- the percent fractures that were avoided each year by the current
medical treatment. So we scan -- if you think you're at risk for osteoporosis,
you'll get one of these bone scans. If this score is less than minus 2 and a half,
they'll put you on an osteoporosis treatment. Half of those people on the
osteoporosis treatments will not fracture if -- that would have fractured will not
fracture. All right?
So if you just do the simple calculations, it turns out about 15 percent of the
population at risk are currently tested for osteoporosis. The test sensitivity that's
been used now, that BMD test is about 50 percent sensitive. And the treatment
works in about half the people. So if you work it out then you avoid about 3.7
percent of the fractures. So if there's 300,000 hip fractures every year in this
country, all this treatment that's going into this is 3.7 percent of those 300,000 hip
fractures are not happening. All right? So it's a very small number.
Okay. So imagine if you had a new drug treatment that instead of having 50
percent fracture efficacy had 75 percent fracture efficacy. That drug would be
worth about I don't know, 30, 40 billion dollars, right? Because it would just
dominate the market. And it would be just revolutionary, right?
So that would change this percentage up to five or six percent. The same
change can be obtained if you could simply have a better diagnostic test. If your
sensitivity went from 50 percent to 75 percent, you'd have the same thing. So
we're already there. Right? Our test is already here. So we're equivalent to a
multi billion dollar drug. But the way money flows in the medical system, it just
doesn't go look that.
So now, if you can test more people, that has a -- for a disease case that's
undertested, you have a dramatic effect that really dominates everything. And if
you can test more people and combine it with a better drug and a better
diagnostic, that's about as good as you're going to do.
So the point I wanted to make here was that doing better testing and diagnostics
is at the end of the day from a functional perspective clinically is better than
making new medical devices and drugs. It can be -- it can have such a huge
effect. It just didn't play out so well in the market because we don't have a big
drug company behind this promoting it. But of course these are all the things I
can change in the future.
>>: Other than money is there a downside to someone taking this drug treatment
if they don't?
>> Tony Keaveny: Yeah. So that's a good question. The drug treatments right
now are fairly benign, but there are some long-term concerns now. So people
who have been on these drug treatments for 10 years were beginning to see
some problems with them. So it's drug that the osteoporosis drug field is very
active drug development field is still very active. People are going for drugs that
have really no negative side effects.
And most of the drugs we have now stop bone loss. They don't actually make
your bones any stronger, they just stop the bone loss. And there are a new class
of drugs out that actually build new bone. But they come with some potential
risks. So it's a very active field to try to get bones that are anabolic that make
new bone that don't have the down sides.
The drugs, the anabolic drug we have right now it's made by Lilly, it's called
Forteo. It's one of the most expensive compounds in the world. So they charge
1200 a month for that drug.
So that's a success story. We want to do the same thing in some other
problems. And the problem we've addressed is stroke. So let me explain the
stroke problem. So one of the problems with the osteoporosis in terms of
implementing it is that you say you've got a better diagnostic test and you go to
get this reimbursed by insurance companies. They don't care, right? Insurance
companies have the following experience. A third of their insurees switch
insurers every three years. All right. So I go to you and I say, hey, I've got a test
and I'm going to identify people who are going to break their hip in five or 10
years from now. Will you pay for this test? Just look at you. No. They'll
probably be with somebody else in five or 10 years and they're not going to pay
for it.
Now, they'd never tell you that straight in the face, but this is what goes on. So
we wanted for this ParLab project, we wanted an media milk that would be
compelling in terms of the need to implement this and to pay for it. And you
could charge a lot of money, not a lot of money, but you could charge a premium
for this test because it would save you so much money. All right? So chose this
very specifically with clinical implementation in mind. And the problem is it's
about strokes. If you have a stroke and you go to the hospital and you get
diagnosed with your stroke, if you've had that stroke more than three hours
previously, you will not be treated. The reason you won't be -- and we saw a little
bit of this the last day. The reason you won't be treated is that the risk of
complication from the treatment is too high. So they think there's 2 main
treatments. There's a little kind of clog cleaning plumbers approach and there's a
blood thinner. And with either of those the risk of you bleeding to death after one
of those treatments is considered too high if you've had the stroke more than
three hours previously. So you go to the hospital. A lot of strokes happen when
you're asleep. You wake up in the morning and you're numb on one side of your
body. And you won't be treated.
So your brain continues to just -- part of the brain is not getting the blood
continues to die and you end up with severe, severe long-term disabilities. Okay.
So these strokes happen primarily in this vasculature in the brain that's called the
Circle of Willis. So it's this little band of blood vessels right in the middle of your
brain. Everything that goes off to the microvascular around the brain comes off
this. So these front and back of these are headed down to the rest of your body.
And this Circle of Willis is where most of the strokes, most of the block ins -- a
stroke is a blockage of your blood vessel occur. And the downstream problems
actually occur out in the microvascular.
So what we're interested in is studying the blood flow in patients who have had a
stroke, and we want to identify those patients who would have lowest risk for a
complication from one of these treatments and therefore would be safe to treat
after the three-hour window, right? Okay. So that's the plan.
And of course the effect of that could be huge. If you could just identify patients
who are safe to treat and treat them massive savings for everybody involved and
dramatic impact for the quality of life for that person.
So this Circle of Willis is interesting. And this is a nice anatomy drawing. This
company Adam does all these anatomy drawings and of course makes you
believe that this is the way it is in your brain. The problem is in your body this
vasculature is a little different for everybody. Sometimes this is missing,
sometimes one of these things is gone. Sometimes these are missing. Only half
of the population have what's in that picture. The other half have a variation on it,
okay? So if we simulate or if you have a stroke -- so let's say here you get a
stroke here. The question is gee, how is that going to change the blood flow
everywhere? And if we give you a blood thinner to reduce the viscosity of the
blood, are you going to increase stress so high that, sure, you're going to get all
the blood from here is going to sent over here and this is where you're going to
have that complication. Right?
This is what we're looking at. Or, gee, this stress is going to be so high here that
this blood vessel is going to pop, right, if you reduce. So this is what we're trying
to look at. And of course you can have these strokes anyplace in that.
So our solution is to take a patient's medical image, his CT scan is what we're
looking at primarily. You could use MR scans. We do a patient specific blood
flow analysis. The stroke is there. And what we do then is simulate what the
blood flow would look like if you have a blood thinner. And then we're assessing
stresses in the Circle of Willis and downstream and we ultimately want to
risk-stratify the patient.
So here is how we're going to try to implement that. So we're going to
risk-stratify patients. What we're going to do like the osteoporosis story is we're
going to do proof of concept on the very high resolution models. We want to
make sure that the physics and the science and having here actually does work.
And we want to validate that observation by predicting people who are at highest
risk for complications. If you don't treat them at all, there are going to be patients
there whose blood flow characteristics are so messed up by the stroke they're
going to have problems without doing anything. So we want to identify those.
That will be a first level of value takings. It's observational, so nobody's being
treated with anything.
And then in parallel with this, we're doing all of the numerics to be able to
develop these coarsen models just as you saw with the osteoporosis. We won't
be able to go into all the detail we have and our goal is to do this in 10 so that
while the treating radiology and neurologists are assessing the patient they're
getting this information after taking the CT scan to risk stratify the patient. And
ultimately we want to identify those at least risk. That will be validated in an
interventional clinical study and obviously since this is been used to diagnose
patients you have to get FDA approval.
These two things I don't think we're going to get to in the time span of this
project, but we want to set ourselves up so they're all set and ready to go. But if
we make fast progress, we will be able to get into that, into those studies.
And here's a state of the art in what people are doing in this related area. So
these are a set of -- this is 2010, a set of 1D models. And it was just generic.
This guy made up a perfect Circle of Willis just from those anatomy drawings and
has treated all the blood vessels as a 1D problem. Fairly simple model. It's got
the solid fluid interaction. This is a tricky problem because we have to do the
fluid mechanics and the solid mechanics. And that's very, very tough and
numerical problem.
And then he played all sorts of games on putting a blockage here and then
seeing how it affects the blood flow. So that's a nice study that allows you to
parameterize and ask questions about what if. But it's not patient specific. And
this project is all about making this work patient specific fast.
So here's the other extreme. Here is a model. This doesn't have any of the
Circle of Willis in it, this just has a little segment of a blood vessel but the
segment of a blood vessel is modeled in a lot of detail. So it's non a
non-Newtonian fluid flow, which is pure pain, and it's got a two-layer, it's hard to
see the second layer, two-layer anisotropic non-linearly elastic representation of
the blood vessel. And that's pretty complicated behavior.
So these guys, this is 2009 so this is fresh off the press and they used a high end
work station and did it just three pulses of flow and it took them 10 hours on the
work station, right? So this is just to give you a sense of what we're up against
here. So what we want to do is take this and put it into that like previous 1D
representation, the full anatomy, and do the whole shebang and then scale that
down to something we can do on multicore.
So the strategy then is the detailed solution on the supercomputer. Run
parameter studies, make sure we understand the physics and what we can throw
away and what we can keep. Apply to clinical cases. The high resolution just to
show in the best case scenario, yes, we can -- all of this makes sense and it
works. Meanwhile we're porting all of this to multicore. We're going to simplify
the models. How we're going to do that depends on the biomechanics, depends
on the image processing, depends on any numerical constraints we have. And
then ultimately go with that to some definitive clinical validation.
So our team is myself. I'm the kind of applications person. Jim and Kathy are
our experts from the numerical computation side. And Panos Papadopoulos is -he was here yesterday. He is a [inaudible] guy, so he's big into stress analysis
from the mechanical engineering perspective. Very importantly we've developed
a collaboration with some folks at Lawrence Berkeley Labs and Phil Colella is a
big time computational fluids guy. And these other folks are full-time staff on his
team who are developing this big parallel code for food flow analysis, it's called
Chombo. I'll show you more about that.
And then our other faculty, most important person there is Max Wintermark, he is
a neuroradiologist from UCSF, so he deals with stroke people all the time. And
David Saloner is a researcher at UCSF into the medical imaging. And Stan and
Mohammad are biomechanics guys at Berkeley. And Mark Adams wrote
Athena, which is our solid mechanics parallel code. So I'll tell you at bit more
about that.
And last but not least of course, the people who make it all happen at the end of
the day, are the students. So we have a number of students here. And that is a
mix of mechanical engineering students and computer science.
So here is a glimpse at Athena. This is a preexisting code we have for doing the
solid part of this. And this is a [inaudible] element code. Am this one the Gordon
Bell Prize for scaleup. So we're starting with that, and we need to adapt that to a
multi core. So the whole front ends of it, there's two layers of parallelization. And
that obviously is going to change. And then all the message passing is going to
change when we get to multicore.
So the other thing of course that we're -- the approach that we're taking with the
ParLab is these motifs. So we're not simply going to just simply take this and
hack a solution the stroke project, we want to develop a more general approach
using the motifs so that what we gain from this applications in other areas, too.
And here is Chombo. The fluids code that's been developed in Lawrence
Berkeley Labs, it's also set up for parallelization. And it -- the nice feature of this
is that it has this automatic adaptive mesh refinement in regions of interest high
stress and stress gradients.
So both of these codes have to be adapted in a number of ways for the problem.
I'll mention a couple of things we're currently working on right now. And then in
terms of the motifs, these motifs are continually evolving, but we certainly have at
least three of the motifs are clearly applicable to what we're doing since we have
this unstructured problem and sparse matrices.
So we've been doing the project for a year at this point, a little bit over a year.
And -- or a little bit under a year, right? I don't know when it official started. So
anyway, what have we done so far? I think we've figured out this strategy.
There were a number of different ways to couple the solid and fluids, and we've
decided to take Athena and take Chombo and actually work with those as
opposed to starting up from the ground as opposed to trying to put having
together in one. So it's a solid's problem and a fluid's problem that talk to each
other. That's the approach we've taken. We've mentioned the team. In terms of
the code, we've got both of these codes now working on the same machine and
the students have been trained in how to do that. That was a bit of an effort, but
that's safely done at this point. We're really hitting is biomechanics in detail. And
we've done some simulations. Well, we'll show you this simulation here. We've
also done some of these 1D simulations. Just like that literature study that I
showed you from 2007. And we have identified a patient. We have a CT scan,
so we're getting read to make it a patient model for that.
Here is a little simulation we've done. 2D simulation. Just running Chombo
through its paces. And it's a blood vessel here. This is the uneven surface of the
blood vessel with a blockage in the middle. So it's to stimulate a stroke that
doesn't quite fill the inside of the blood vessel.
And then you're looking at the blood flow. This is the, let's see, this is the velocity
that's the pressure and that's the vortices over time. And this should be all nice
and continuous, so that's my MacIntosh. I don't know what is up with it that it
won't play this a little more smoothly.
But this is just to show you that we're beginning to set up these problems and
see where the particular challenges, numerical challenges are that we have to
deal with. So complicated patterns.
But what we're mostly interested in is downstream changes in the larger
vasculature and the very local changes around the stroke. And part of the
reason for that is you can't get the kind of fidelity from a medical image that
justifies extremely detailed numerical fidelity in those regions. So you're limited
somewhat by your input data.
Okay. So where are we going? In the next we're we have some specific
objectives that are fairly well defined. We want to reproduce what the folks in
literature have done and expand them. So take this 1D approach, so we can
have the vasculature from the patient with that, but take a 1D approach to
modeling the actual vasculature and do a number of parameter studies with that.
And we've got the fluid-solid interaction in those models.
We want to run a patient-specific analysis, full 3D analysis on the CT scan. But
we'll just do that with Chombo with the fluids. That will give us the information we
need to start thinking about mesh sizes and time steps.
The codes themselves have to be altered. With Athena we have to add in
non-linear elasticity. We don't have that. We have plasticity, but we don't have
non-linear elasticity.
And we want to bring in tetrahedral elements. And in Chomba we want to put in
this moving embedded boundary. And Chombo has to deal also with the curved
sides of the blood vessels and we're currently working on that as well.
And then these two we need to couple these together.
And then next we're going to continue refining the code, bring in image
processing. The main thing is we're actually going to do some clinical cases.
That's the idea next year. At high resolution.
And then as we go towards the end of the project, optimize all of this and we're
into multicore and then actually do clinical cases with multicore. And then these
are outcomes. We want 10 minutes. We want to show we can do it in 10
minutes and we want this to work in a more general framework. I think that's all I
had to say.
>> Jim Larus: Questions?
>>: Yeah. There's a guy up the hill from you at LBL named Lenny Aliker
[phonetic] who you may know. He's done some interesting work with adaptive
irregular meshes. The motivation was essentially the kind of [inaudible] numbers
you have here where adhesion to the wall is very important. In fact, your
problem is much worse because the geometry is time variant eventually. He
doesn't -- you know, so that work was done at [inaudible] but he will remember
really well. It's interesting because you can essentially build the mesh to conform
to the law and just the resolution of it based on the magnitude of the vorticity, for
example, or something like that. In an adaptive way.
>> Tony Keaveny: Right. Right.
>>: It's wonderfully challenging to program and in fact it may be quite useful to
use transactional memory when you're updating the mesh. But some sort of
transactional scheme. But it's a very cool way to address this kind of
multiphysics model.
>> Tony Keaveny: Right. Right.
>>: Lenny is certainly in all these conversations as well. He doesn't happen to
be on Phil's team but he's certainly part of that conversation.
>>: Yeah, yeah. I know. Yeah. He's just in the neighborhood of Phil not ->>: Yes.
>> Jim Larus: Any other questions?
>>: How long in time to you need to be simulating? And like what's your time
[inaudible].
>> Tony Keaveny: Yeah. So I would imagine its the simulation just goes for a
few seconds and the time steps milli or microseconds. Hundreds of thousands of
a second, I would imagine. I'm thinking.
>>: So you spent -- have you spent any [inaudible] happen within a couple of
second of the [inaudible].
>>: Right. Right. Right. Because we're -- I think you'll be looking primarily at a
steady state solution, right? You're just saying look, the vessel's been there -the clot's been there for a long time. So at this point in time what's happening?
>>: There is actually two very different types, that solid [inaudible] looks a lot
more friendly than the fluid, so there might be as many as a hundred fluid time
steps in between solid steps.
>>: Right.
>> Jim Larus: Let's thank our speaker.
[applause].
>> Brad Werth: Hi, everyone. My name is Brad Werth. I work at Intel. I work
with video game developers to optimize their games for the Intel platform.
Obviously Intel does not create games. Our relationships with video game
developers run a whole continuum from sort of a consultation level down to a
we're going to look at the code, bring it in house, do some work, et cetera.
So this, the presentation you're about to say is sort of a state of the industry in
the way that Intel thinks about it, sort of where parallelism is for video games,
what are the current key challenges that developers that are trying to navigate
and how Intel is advising them to navigate those waters wherever they are on
that continuum of working with us, whether we're just giving them ideas or
actually doing the work with them.
So as you may have heard or may have intuited, video games are not
embarrassingly parallel. They are complicated to parallelize. There's an
interesting sort of fact is that there's a lot of individual systems and elements and
efforts inside a video game that independently it's well understood how to
parallelize those. We'll look at some of them here. But the interactions between
them is an integration problem like many of the other speakers have been talking
about.
Maybe the difference in the video game case, though, is that this is not an interim
process coordination problem, this is all within one process. And there's an
expectation that a game is going to be running essentially unmolested on the
hardware. You're not going to be competing with, you know, somebody encoding
a movie in the background or something like that.
So as long as internally all the coordination could be made efficient there's an
expectation that that's good enough.
>>: [inaudible] or are you talking about the server as well.
>> Brad Werth: I'm talking only about clients. And of course partitioning
strategies of hardware resources can fail for a number of different reasons. One
of them is the third bullet here. Moment to moment the workload of a game
changes dramatically. You can't ascertain up front that, okay, this is exactly the
right partitioning resources I want for the duration of the experience. You need to
be very flexible in the approach that we advocate as a flexible one. And similarly
most commercial games are targeting multiple platforms, even if they're just
targeting the PC, of course the PC is this, you know, polymorphic platform that
has many, many variations.
So it's almost impossible to specifically tune a game for the realm of hardware it's
going to be played on. Even if it's a game on the Xbox 360, the Play Station 3
and the Wii, those are all very different.
So one caveat to all of this, many of you have been giving presentations where
you've taken your research approach and you've employed it successfully to a
problem domain, you've got a sort of here's where we are, here's the next steps.
I don't have that level of detail for you because we haven't had the happy
confluence of a game developer that's doing the sort of broad effort of parallelism
that we want them to do combined with a developer who's willing to share all of
their code and knowledge with us. Finding those two together is a bit
challenging. We have one or the other. So bear with me.
So the summary of what I'm going to propose is that task based parallelism can
be applied to most of the threading paradigms that game developers typically
use, and if it's done successfully and well it can move all of these inner operation
problems. So we're going to look at that.
This is a very high level gram of some things that might be doing on in a typical
game architecture. And put yourself in the mind of a game developers trying to
figure out how to parallelize this for all of the platforms you're going to target.
So you've got a particle system. We know the particle systems are essentially
the most embarrassingly parallel element within a game. Got a big array. You
can break it up into pieces. How many pieces? How many threads should you
dedicate to processing those pieces? That partitioning issue is non trivial if
you're targeting multiple pieces of hardware. Likewise, maybe you've got some
asynchronous jobs that you're going to call. You have -- you know of course that
you could define some job threads, throw work under those threads, but again
how many threads should you dedicate? Is it possible that the particles could be
processed on the same threads, et cetera?
You're using some physics middleware. You know that the physics middleware
is threaded, but you don't know how. You don't know how many threads it's
going to create. You should probably anticipate that it will oversubscribe your
system. That's to the avoided. You've got these operations and games that of
traditionally been handled by dedicated threads like sound mixing, level loading,
some texture manipulation in the background. What does it mean if you know
you have a certain set of that type of work in conjunction with all of this frame to
framework? How can they all work together?
And then you've got some kernels deep in your game that you know can be
described in sort of a directed acyclic graph structure which is a pretty well
understood method of parallelization but again how do you get it working with all
these other approaches?
So our premise to the developers is that if you can define all this work as tasks
mapping the natural sort of prototypes that are used in games to a task
implementation, then you get a lot of nice benefit from that. You can use a single
thread pool that will avoid oversupervision. And almost by desks you'll have a
game that scales to different topologies.
So we're going to look at what those prototypes look like and how they can be
decomposed into tasks using the Threading Building Blocks, APIs as an
example. A lot of game developers write their own task scheduler or they use
other available task schedulers. That's fine. The techniques in general are
applicable to those as well. Although a lot of the pros and cons I'm going to be
talking about are specific to the work stealing schemes that are present in silk
style schedulers such as threading building blocks.
I'm going to be showing code on the screen. No need to, you know, scramble to
take down notes. It's all posted here. You'll see that URL again.
Okay. So the most easy stuff that you can do to parallelize work in a game,
really any application is to find those big loops and break them down. TBB
makes this very simple. They have a construct called parallel for. It looks like
this. If you take your humble for loop, you're iterating over an array, you're doing
something to every element what you're going to do is you're going to embody or
embed that for loop inside of a context, in this case a class, and you're going to
use the function operator to supply a range to iterate. So instead of iterating over
the entire range every time, you're going to iterate over a parameterized range.
And again, this is the TBB idiom. It's -- this is a specific way of doing it, but the
general idea I think is translatable.
And then of course you're going to actually invoke the action of the loop in a
different way. In TBB it looks like this. You call parallel for, you specify the total
range, the context object that you defined earlier and internal TBB is going to
break down the different calls to that context objects on separate threads,
separate tasks which hopefully will be mapped to separate threads. And it's
going to supply subranges of that total range. It looks -- I'm sorry. I hit some
terrible button. No, it's not a Mac.
I think I hit the panic button when oh, I didn't mean to show that slide and I just
unpanicked.
So it will -- when you do this, it looks -- this is the sort of before picture of where
we're at. This is a subsection of a screen shot of Intel thread profiler. We're
seeing a long shot of computation at the top that I happen to know because of
the demo applications came from. This is our unparallelized particle system after
this call to parallel for it gets broken up into these chunks, gets mapped on this
case to seven other threads on this 8 core -- well, 4 core hyper thread system.
So that's the primary benefit. You want to be doing this as much as possible
within your game. But of course many of the approaches to threading that are
used in games aren't as simple as ours. I need to go through this loop. There
are these other more general approaches. And for that you need a lower level
way to decompose those approaches into tasks. Thankfully TBB has a low level
API. The way the TBB API works is it defines multiple trees of work that can be
processed -- you could think of it as a -- well, we know what trees are. But it's
not quite as expressive as a DAG but is processed in the same fundamental way.
This list I have here is five examples of typical ways that games want to use
parallelism. They want to -- and I'll describe each of them in detail. But we're
going to look at how all of those can be broken down into tasks using the TBB
low level API as an example.
So the first concept is a callback concept. This is very basic. It says I've got
some work. I'm going to give you a function pointer to that work. I want it to be
run when you can, soon, just not on me. You know, give it to somebody else. It
looks something like this. The concede we make here is that we're not actually
given any handle back to the result to wait on the result. It's the responsibility of
the callback itself to throw a flag or set some signal that's going to be checked in
our main loop that will indicate that this operation has completed.
I've got both code and code and tree diagrams on here for the visual and the
literal people, so maybe you'll find one that speaks to you. This is how you can
implement this in the TBB API. It starts by assuming an existing root somewhere
out there in the world that we're going to attach these other tasks to. The task is
created. When it's executed it's going to call that callback function. And then
you see the little cloud at the bottom there which indicates in a perfect world
calling that callback will produce more work that can be put into the tree. There's
more cloud. Because if all you have is a series of callbacks, this represents a
certain amount of callback will produce more work that can be put into the tree,
this more cloud. Because if all you have is a series of callbacks, this represents
a certain amount of overhead to sort of corral it into the TBB idiom. You're not
going to be gaining very much. You need to have those opportunities of doing
loop parallelization or other forms of data parallelization.
Finally is task itself is spawned. The root can be thought of here as a
continuation that will be spawned when all of its children are finished. But the
tree is structured such that that can never happen. Every time a new tree -evidence time a new task is added, the child count of the root is incremented so it
never actually completes. It's there so that you can in aggregate clean up the
whole -- the whole stack of tasks, either at shutdown or if you need to do it
periodically per frame, et cetera.
So these callbacks are simple. They're powerful but they have some limits. The
calling code is never waiting. Stuff's run on demand. But it's not waiting, so that
means that you never know when the call back is finished. The call back has to
declare that the work is completed. And of course there's no waiting, so what if
you're running on a one core system will your callback ever be run? No. You
need to do some special case stuff to make sure that that happens in one core
case.
So this is a paradigm that although it gets used is typically after a -- the
developers have more experience with parallelism they tend to abandon this and
move on to a related paradigm pattern which is promises. Promises are
essentially an evolution of callbacks that resolve some of those problems I just
talked about. Like callbacks they have function pointers that are going to be
executed on another thread, execution begins right away. But unlike callbacks
you're given an option to use to the handle to the result itself.
This is more or less modelled off of a similar pattern in the Java 1.5 specification
called futures. Maybe you all have experience with patterns and have you know
a pow name you can apply to this. I just make stuff up.
So the tree for that looks -- the tree and code looks something like this. Instead
of relying upon a preexisting root to the tree, we're going to create one for every
promise. And we're going to encapsulate that root inside another object. That's
really sort of an idiosyncrasy of the way these work trees are processed. We
encode it or we embed it inside another object because a task in the work tree is
essentially the memory is reclaimed as soon as the execution is finished.
Since we want to have a valid pointer object to the result, we have to define that
ourselves and not rely upon a raw tasks because a raw TBB task is not
guaranteed to live beyond execution. In fact here it's almost guaranteed not to
live. So in this case we embed the root inside the promise object. That will be
our return object. Once again, the task is made. A child of this root and
spawned. And what we see is that this call to wait is doing a -- doing basically a
double safe check on the task itself. You can see that the first if block inside the
function is checking to see if the root has already been nulled. This is the signal
we use to indicate that the work is complete.
If it hasn't been nulled, it then puts together a lock and checks again. This is -I'm sorry. There's a question?
>>: [inaudible].
>> Brad Werth: I'm sorry. I just can't hear what you're saying.
>>: So if you're going to [inaudible] this activity versus [inaudible] still tries to
hide the runtime, so I'm [inaudible] for all these patterns [inaudible] you actually
[inaudible] where you actually [inaudible].
>> Brad Werth: Okay. So the question was Cilk does this but it hides the
runtime, could these TBB examples be applied successfully to Cilk and what
would that look like the basic question? I don't have a whole lot of experience to
programming to Cilk directly. It's not the actual Cilk scheduler isn't generally
brought in in whole cloth into a game. Usually people are copying the semantics
of the Cilk scheduler with their own APIs. So I just don't know.
My presumption is that any Cilk style scheduler, my definition of a Cilk style
scheduler is one that uses task stealing and has independent queues per thread.
Any Cilk style scheduler is not necessarily a work tree oriented scheduler. TBB
is. I'm not sure what Cilk was. Was Cilk doing work trees, was it doing graphs?
Neither?
>>: [inaudible] continuation [inaudible]. But my really question is like if you
[inaudible].
>> Brad Werth: Your question is if you just have the two primitives in Cilk which
are.
>>: Spawn and ->> Brad Werth: Spawn and sync can you do all of this, or do you need all this
extra structure of the work tree? My guess is that you probably can do it just with
spawn and sync. TBB is a freely available library. We put this out there to
developers saying, hey, you don't have to try to adapt Cilk into your game. You
don't have to try and write your own scheduler. You can just use the solution
we've got. Here's how you might do it. So it's a marriage of convenience in this
case.
Okay. So getting back to this what you're seeing is you're seeing that the wait on
the task is really a wait on the root and there's a lock in there to allow multiple
threads to be able to wait on the same result. If your architectures's designed
such that that isn't necessary, then this lock also is not necessary. If the same -if the same thread that dispatches the promise is always the only one that ever
waits on the result you don't need this extra step. But you can see that the root is
waiting this on this task complete and when it's done everything is cleaned up,
the pointer sets and all.
So promises are pretty great. The wait is only blocking if the result is not
available at the time that it's requested. If the wait does block, presuming that it's
doing more that again that cloud of work that indicates a larger tree, if the wait
does block, the thread waiting on the result will be able to jump in there and start
doing some of that computation that will accelerate the completion of the result.
Implementing this on top of TBBs, not a lot of code, and you could take it further
as is done in the Java 1.5 specification to allow cancellable jobs or partial
progress updates.
So this is a pattern that's finding some traction with game developers.
Okay. Synchronized calls. Variation on callback. The premise of a
synchronized call is that you want every thread that the in your invisible,
untouchable thread pool, every thread to do something right now before to go
anything else. It's sort of like a barrier or a fence. This is useful for the reasons I
put up there on the slide. But essentially you're initially thread specific data or
cleaning up thread specific data.
This is absolutely trivial to do if your task scheduler allows you direct access to
threads. But not all of them do. Here's a way you can do it.
Again, code and a tree. We're creating a root. And we are going to create N
children of that root which is equal to the number of threads in the threat pool.
What those children do we'll see on the next slide but in general scheme you can
see that they're going to call a callback and then they're going to test and wait on
in this case anatomic variable. We'll see what that looks like.
We spawn the root. All of the tasks go out there. They get mapped. Through
work stealing they get locked into exactly one task per thread in the thread pool
because once a thread executing one of these tasks, it can't do anything else.
It's given an atomic variable to fetch and deck increment, and it's waiting for that
variable to be reduced to zero. I can imagine let's say one thread in the thread
pool gets all of these tasks on its queue. It executes the first one and opportunity
finish. Meanwhile, the other threads in the thread pool are not being given other
work to do, they're evenly going to need to steal work. They'll eventually steal
from it this thread, pulling one of the uncompleted tasks off of its queue. That
basically ensure that every thread gets exactly one until they're all executed, and
then the busy loop finally because out and you exit.
>>: [inaudible] generating a lot more work. Is there a -- are you doomed to wait
for a long time?
>> Brad Werth: The question was what is the background process? The
background task is generating a lot of work are you doomed to wait on it?
>>: [inaudible] something that you want that's, you know, kind of low priority and
-- but it has a lot of work, are you just going to wait until that [inaudible].
>> Brad Werth: The low priority example is actually the next pattern I get to.
This case is not the same as that. This is saying you know, I need to do
something either at the beginning of the initialization of program to coordinate
with some middleware that needs some thread specific data or it's -- maybe it's
something you have to do every frame, although that can be quite inefficient for
the reasons I just described. But low priority operations are a different thing.
So this is useful. It gets about -- it solves the problem, but this is not an efficient
operation. If you're doing a synchronized call in the middle of a bunch of other
work, you've got to wait for all of the thread in the thread pool to flush and
complete all that other work until all of them sort of synchronize up on this one
operation and then they all can move on. You don't want to do this in the middle
of a frame.
That said if you did it as the first bit of work at the beginning of every frame, if you
needed to there's really no performance policy for that.
Okay. Long, low priority operations. That was my interpretation of the question
you were just asking. These get used in games a fair amount. For several
different reasons. Loading, sound, activity, textures, AI path finding, et cetera.
The traditional method for this, even long before we had real parallel hardware
was to say, okay, this computation is going to go on a while, I'm going to get
results out of it every so often. I just need you to leave it running at some
proprietary, so I'm going to make a dedicated thread that does this. That works.
But almost by definition it introduces oversubscription into the system.
There was a way this was done before threading paradigms were broadly
available and that is as time slices. Time slice algorithm of course is one where
you define it such that you can run for a period of time, get your result and return
out and then you could call it again next frame, run for a period of time, get your
partial result, et cetera, keep doing that. If the same operations today are being
applied to dedicated threads can be rethought of as time sliced algorithms, then
you can do a task based approach of this. Which might look something like this.
Imagine if you will that these two work trees that you see on the bottom represent
on the left hand side an existing expanding tree of primary work that the
application is doing. The one on the right represents a special sort of dumping
ground that we use to put only low proprietary tasks on and because of the
implementation details of the scheduler we happen to know that tasks that are
stealing work are unlikely to go to that tree to pull down tasks.
What we can do is as we -- as we are processing our main activity at work, we
can check and see if it's time to kick off another low priority operation. Again, this
is to through an atomic variable, which will eventually be flagged once the atomic
operation completes. This essentially ensure that there's only one request at any
given time in the system.
Then we can add that low proprietary task back to this preexisting root, spawn it,
and get on with what we were doing before. If we don't have any other work to
do, if what is called here on this diagram, if that parent is actually a leaf, you're
more or less guaranteed to proceed directly to the low proprietary task. Because
if your work queue is dried up and you just put a task attached somewhere else,
you're going to go do that one next.
So this works, I'm going to use finger quotes, air quotes around this, this works
but it's not fantastic. It doesn't give the level of control that most developers
want. There's some reasons why this is hard to set up. I sort of have two lines of
excuses and then two lines of explanations. My excuses are a task based
scheduler can't just naively -- well, first of all it always runs a task to completion.
So you can't say I'm going to run this for a while when I get tired of it go do
something else. That's not what a task is.
Secondly low priority task can't be rescheduled immediately following
themselves, you can't just say I'm done, let's do it again, let's do it again because
that essentially ties up one thread in the thread pool always doing the same
thing.
As far as the proprietary issue, a lot of developers say I want to have real control
over all the different relative activity in my work on my task queue and this one
I'm just going to call it low, and my expectation is that it will be run only when, you
know, there aren't other -- isn't other activity available.
The problem is that the work stealing paradigm used by the Cilk schedulers,
imagine a scenario where you've got one thread that has maybe a normal
proprietary task and then some low proprietary work. Finishes that normal
proprietary task and it starts in on the low proprietary work. Elsewhere you've got
several other threads in the thread pool that have lots of high proprietary work.
But they're only doing some portion of it, some of the existing low proprietary
tasks. Should that first thread start looking before its queue is empty, looking at
all these other threads to say to you have high proprietary work, do you have
high proprietary work?
That becomes -- that sort of loses the benefit of the work stealing scheduler
because now you're essentially synchronizing at the end of every task completion
saying okay, I need to go out and check with all of my other thread, do you have
high proprietary work, instead of just running to completion and then stealing
work. So there's some challenging with this approach. We think it's worth trying.
Directed acyclic graph. You may be wondering how a work tree can successfully
represented a directed acyclic graph. It's not too hard. This one doesn't have
code shown on it, but it's pretty easy to understand. Essentially imagine the root
and more combinations representing nodes in the original graph. You can see
that the nodes in the graph that have no processors can be spawned
immediately, and as they complete, they can notify nodes further on in the graph
that can then spawn themselves, which is kind of fun to say, spawn yourself.
It functionally performs as a directed acyclic graph with some modest amount of
overhead. Hopefully not too much, as long as all those more sections are
interesting. Directed acyclic graphs again it's a technique that gets the job done.
One difference between your conception of a directed acyclic graph and this one
is that these trees are or these graphs are actually destroyed by waiting on them,
they're not semi permanent constructions that you reuse from frame to frame. If
you want to do that, you could follow a similar approach that we used in the
promises pattern, and you could embed the individual nodes of the graph inside
other objects that are persistent. And of course your graph scheduler might be
graphed based to begin with, which would make this trivial. And here's the URL
for the code again, if you want to see some code.
So happy ending. We had our particle system. We're more confident with that
now. We know, okay, if we've got a big loop like that, we can break that into
tasks using automatic methods like parallel for. It's all going to go into the task
scheduler controlled thread pool. That sounds good to me. Likewise, if we have
arbitrary asynchronous work we can dispatch it into the same task scheduler and
therefore the same thread pool using callbacks or promises.
If our physics middleware can dispatch work to us, as long as our threads have
been properly prepared, and I know this sound like a contrived example, but the
reason why I bring it up is this is exactly what havoc physics does. It attempts to
solve this problem by allowing you to indicate that your threads are havoc aware
threads and then it can start sending work directly into your system. And that
can be done with synchronized calls. Long low priority operations can be time
sliced, broken up into pieces, but into the job system. Sound may be a poor
example here, because one of the reason why sound is often on a dedicated
thread is so that it can be kept at very high proprietary. So if I had -- could -been thinking a little bit more, I would have changed that sound thread to level
loading or something, just to give you a sense of what a truly low priority
operation would be. And of course random kernels within the game such as a
directed acyclic graph approach for doing bone animation, those tasks can be
shunted off into the same task scheduler and therefore the same thread pool.
So this is the message we try and leave with developers when we talk to them.
Task based parallelism is going to scale better on the different architectures that
game developers are targeting. You can break loops into tasks for the maximum
benefit, but then you can also use tasks to implement all these other sort of one
off approaches that are typically used in games. And that is my final slide.
>>: How would you use the system to synchronize access to different data
structures? [inaudible].
>> Brad Werth: So the question was how does this -- how does this deal with the
issue of synchronizing access to data? The answer is it doesn't. This sort of
premiums that there's a parallel approach already available or already defined by
the developer where they've introduced the locks for the data copying and
various other things that are necessary. But their challenge now is integrating
different regions of code that uses techniques to use the same thread resources
effectively. We've done some other talks on that sort of thing. We do it actually
at GDC every year as a day long tutorial. That's just a different subject area.
And it's obviously very tricky.
But once this presentation is pointing out that even once you solve that problem,
then there's a second order problem that needs addressing.
>>: Any thoughts on the [inaudible].
>> Brad Werth: The question is about hyper objects and Cilk. Again, you're
presuming that I have knowledge of Cilk that I don't have. I'm not sure what the
hyper objects do. Let's talk afterwards and you can educate me.
>>: Did you experience like defects of cache locality issues? So if you do a lot
of different things it's the same task schedulers. Are you running a risk that, you
know, data has to [inaudible].
>> Brad Werth: The question is does this exacerbate cache locality problems by
to go lots of different work in the same threads. My answer to that would be it
may but no worse than having oversubscribed threads that are going to be
swapped in and out at the OS level, which are then going to have the same
cache flushing behaviors. It's no worse, as best we can tell.
>>: So a lot of this is [inaudible] synchronization aspect. And did you see any
problems with making sure that the access to data is [inaudible] you know with
dependencies between tasks, how you [inaudible] like the game logic may
depend on results from a collision detect which may depend on results
[inaudible].
>> Brad Werth: Yes. The question is again about what about data
dependencies and how can the tasks help enforce that maybe. The reality is is
that that is a problem that has to be solved before going to this point. The best
ideas we've had on this is typified in a program we put together called smoke
which was a very scalable game like architecture for at the time 8 core hardware
which we feel can adapt arbitrarily. The approach taken there, to summarize it
very briefly is that instead of saying that there is an object which is copied to
different subsystems, physics, AI, you know, in this case procedural fire and
various other things, instead of saying there is one object that has to be
duplicated across all these systems in order to run currently, we said let's take
the components of that object and give ownership to each section that has the
right control right as in I can write the data to that subset. So your physic system
has a specific orientation, your AI system has the goals an orientation and
everything else.
And so there is no single object anymore, there are only these components, and
the individual subsystems are communicating with each other via published and
subscribed method that happens once the beginning of every frame to get
updates to read only access to things that they care about for the duration of the
frame. It seems to work okay. But again it's very data happy and that is a big
deal on consoles, a big no no. So console developers often need to do other
short cuts to solve specific problems, these one off problems as opposed to a
comprehensive approach like I just described.
>> Jim Larus: Thank you very much.
>> Brad Werth: Thank you.
[applause].
>> David Stern: My name is David Stern. I'm from Microsoft Research in
Cambridge, and I'm going to talk about a bunch of different applications of
parallelism in machine learning. So for the first kind of third of the talk, I'm going
to talk about a fun application which is Computer Go. And hopefully tie that in
with the rest of the talk, which is on parallel message passing, message passing
is a way of doing efficient inference in graphical models.
So this is a screen shot from a video came we're actually working on at the
moment. It's Xbox Go game. What Go is, is it's an ancient Chinese game about
4,000 years old. It's currently played by about 16 billion people across the world,
mostly in Asia. Standard two players, black and white, and they use a board
which is a 19 by 19 grid. And they take turns to place stones on the vertices of
this grid. And once a stone is placed it is not moved but it can be captured and if
you have a chain of stones on the board which are connected together then you
capture them all together. And the aim at the game is to make territory on the
board. So here white has the territory on the left of the board and black has the
territory on the right.
And the reason that Go is interesting to researchers is because it's difficult to
produce a strong Go player. So you know, Chess was to all intense and
purposes defeated by computers back in '97 when Kasparov was beaten. But
still the best Go programs can't beat strong amateur Go players. And the reason
for this is twofold. So in order to do the brute force search which is so successful
for Chess, you need to do a look ahead and then evaluate the resulting positions.
And the branching factor, the number of legal moves in Go is much higher than
Chess, so that makes the look ahead less efficient. But more importantly
evaluation is much more complex, so the evaluation and positions in Go is much
more complex because all of the pieces that each player has are the same, you
can't just do things like use the inherent point value of the pieces to estimate the
value of each player's position.
So back in about 1995, someone proposed a solution to the problem of the
difficulty of evaluating Go positions. And that was this -- if you have a given
position you could stochastically estimate its value by actually simulating a
random game forward from that position until the very end of the game. And at
the very end of the game you can score -- you know who owns which part of the
board, so you can -- you know who won and you can use that information to
evaluate your initial position.
So and that was just sort of a thing of interest to researchers. But the big break
through that was made a couple of engineers ago in computer Go about 2006
was that you could actually bootstrap the policy you used to play these stochastic
games towards stronger and stronger games based on what you've seen in
previous games. And actually produce much stronger play. So this diagram kind
of summarizes the situation after you played three of these random games which
we call rollouts. So each circle in this game is and each arrow is a move, so
taking the game from one ball position to the next. And these squares just
indicate the outcome of that rollout.
So here the rollout's on the left with a loss, and the other two were wins. And as
you play more random games, you start to see positions high up in this story
more than once. And you play more random games and you can see that you
build up a tree. And the is to store this tree in memory or at least part of it in
memory, and based on the statistics of outcomes that you've seen from these
games that have been played that have passed through these positions, you can
decide how valuable these positions are and bootstrap the games towards
stronger games.
So for example, the node that's highlighted here has been seen three times, so
it's been seen in these three rollouts. And two of them were wins. So we know
something about how valuable that position is. And as we play more rollouts we
bootstrap the policy towards playing stronger games.
>>: [inaudible] down that graph?
>> David Stern: That's correct. Yes.
>>: Random moves or are you just using a weaker player?
>> David Stern: So it starts -- so you can do it with -- you can start out with
completely random moves. We actually use some passion matching to slightly -to make them slightly non random. And then as you play more, you're becoming
less random because once you learn which positions are good then you tend to
focus more on playing moves leading to stronger positions.
>>: To generate some random moves for the Monte Carlo part, for the parts that
you haven't seen before?
>> David Stern: Correct.
>>: So as I said, are you doing that completely randomly, or couldn't you just
plug in a weaker Go [inaudible].
>> David Stern: This is a whole subject of research, what makes a good Monte
Carlo rollout. It turns out that it's not necessarily true that stronger play
corresponds to a good rollout. What you want is something that gives you a
calibrated evaluation which might not necessarily be the same thing.
>>: [inaudible] to the same board position?
>> David Stern: Yes. That will happen occasionally. That's called a
transposition. Because this algorithm which is used to decide which moves to
play tends to explore a very small focus part of the tree. It's not actually a very
important issue for this type of search.
So the algorithm which is actually used to decide which move to make at each
point is called UCT, which stands for upper confidence applied to trees. And it
has convergence guaranteed. So if you have an finitely fast computer you will
play perfect Go. Obviously you don't, but the more -- much more important thing
is you get a smooth improvement in playing and more computer time you give
the algorithm.
And you can stop at any point when you run out of computer time and go with
your current estimate.
>>: [inaudible].
>> David Stern: This is simulations per decision you're wanting to make.
>>: [inaudible].
>> David Stern: So this is -- so this is -- I'm in a [inaudible] position. I'm trying to
decide what move to make. This is the number of simulations I make.
>>: [inaudible].
>> David Stern: So it's potentially a huge amount of work here.
>>: [inaudible].
>> David Stern: In a game, on a smooth size board maybe sort of 60 or
something like that on the 4 side boards about a hundred -- 230 on average,
something like that. So let's just show the wind rate against some fixed opponent
as you increase computation or computation effort.
Okay. This is something which obviously lends itself to some parallelism. You
can simply run these rollouts in parallel. And in some cases this is just pretty
much for free. You have 2 completely independent computations. The only
issue is that when you're updating your -- what you're storing about these
positions you have a conflict in the root node here because you just want to
make sure you don't write at the same time, so you have to use some sort of
locking mechanism to prevent -- to prevent you writing something invalid.
More interesting is when your rollouts that you're doing in parallel tend to sort of
explore the same part of the tree. Then you might have more conflicts. But more
importantly you're actually making the decision approximate what move to make,
possibly based on stale information because if you're doing these two, these two
rollouts at the time, at the same time, you're obviously being less information
efficient than if you were to do them sequentially because you can't do use the
result of the previous one in order to inform me about what's the best move to
make in the second one.
So this is something kind of interesting going on here which is that we have an
algorithm that you can change the order in which you perform the computations
and it didn't break of much it becomes less efficient but it doesn't break. And it
turns out that the speedup advantage of being able to parallelize it out outweighs
this inefficient use of information. And I think this sort of -- this is going to be a
sort of general theme of this talk.
And just to prove that it does help, this just shows the number of processes
against win rate against a fixed opponent for a fixed one second of thinking time
to make a move. Okay. So that's the section on ->>: [inaudible].
>> David Stern: That one was on a PC. But so on the Xbox you just have three
calls.
Okay. So now I'm going to talk about message passing on factor graphs which is
the general framework for doing inference in large scale probabilistic models. So
a factor graph is a buy par tight graph which represents the factorization structure
of a function. It has two types of node. You have square nodes which represent
factors in the function and you have circle nodes which represent variables in the
function. And the edges of the graph show on which variables each factor
depends.
And the type of question we want to answer with factor graphs to probabilistic
models is what are the marginals of the function and so the value with some of
the variables summed out? So just to give an example, this -- a simple case
where imagine we have some data which is given by Y, and we have some
parameter S, and we have some model which we believe explains our data, P of
Y given S and we have some prior distribution on the value of the parameter S.
You can represent that by this factor graph and so that corresponds to the
product of these factors corresponds to Bayes' law which we need to compute in
order to determine the posterior distribution of S.
Now t reason that factor graphs are useful is they allow us to express more
complex models yet still perform inference substantially. So imagine in order to
explain our data now we believe that it's easier to do that if we break down our
model into a set of parts and introduce some intermediate variables.
So for example, we might think there's some variables which we call T, which
depend on the S variables, depend on our parameters. There's some other
variable called D which depends on our T variables, according to the statistical
relationship P of T given T1 and T2. And finally our data depends on the D
variable using P of Y given D.
But we're not actually interested in the value of any of these intermediate latent
variables, the D and the T variables. All we care about is the posterior
distribution of our S variables. So in order to get the distribution we want, we
have to send out the values of the variables we're not interested in. So that's the
computation we have to perform. And message processing is simply sort of used
to refer to the fact that this computation can be broken down into local
components in this graph. So the arrows in the graph on the right
correspondence to parts of the equation on the left. And at the end of the day,
once we've performed message passing, then we calculate the marginal
distributions of the parameters we're interested in by multiplying all of the
incoming messages into that variable.
So I mean there's no time to give it a proper tutorial on this, but I think the key
message is that factor graphs reveal computation structure based on statistical
dependencies and that the messages are results of parts of this computation
which can be localized. And we're going to talk about how this means they can
be parallelized. And one thing that we make is a lot of in this work is infer.net
which is a line that's been built at the lab in Cambridge for performing these
message computations.
So now I'm going to talk a bit about ways of parallelizing merge passing in
general. So now let's imagine we have another situation. We've got a set of
variables S1 to S6. We have independent prior distributions on these variables
represented by those factors. And we have some data which are the Y variables.
And each of these data points depend on some of these parameters. So for
example, Y1 depends on S1 and S3 according to the PF41 given S1 and S3.
And okay. So an obvious thing we could do would just be to process the
messages, split the data into two parallel streams versus the messages
corresponding to those data streams separately, and sometimes that will work
very well. And other times we may get conflicts and in many cases we can just
deal with this using standard locking types of techniques.
But sometimes we have models which are very dense and there's some
variables on which all of the data will depend and parallelization will be very
inefficient.
So the way that we deal with that is to actually duplicate the shared variables, the
latent variables of the model a number of times, depending on how many parallel
data streams we want to process and divide the data up between those parallel
streams and we can do this -- we maybe able to do this in such a way that we
can reduce conflicts a lot. So now we can completely independently process
message passing in this part of the graph sequentially, in this part of the graph
sequentially, and then at the end do a reduce operation to multiple all of the
messages from these copies of the models together to get the final result.
So there's sort of two overall methods. This is a locking method where you just
run the messages, the message updates in parallel and you use locks to make
sure you avoid conflicts. And you can use this -- or you can use this cloning
method where you duplicate the model in order to avoid conflicts. And this gives
us better scalability and it's easier to work across machine boundaries. But it
may lead to slower convergence and uses more memory.
So a best example application of this cloning technique -- yes?
>>: [inaudible].
>> David Stern: Yes. So actually this is sort of a slightly old slide. So there is a
-- so often we want to just pass through the data a single time. If we pass
through the data a single time and you just combine all the results at the end,
then you will get a worse approximation by splitting it up in this way because
you're making less efficient use of information. But actually in the example I'm
just about to give, you iterate the whole thing so you message passing here,
message passing there, combine the results, then farm out those results so then
iterate, then actually you avoid any additional approximation. But you just have
slightly slower convergence which is offset by the fact you're running them in
parallel.
So this is an application in biology. The model is very simple factor analysis
model. So you have quite high dimensional data, 50,000 dimensional data which
is represented by this blue vector and you assume that it's generated by
multiplying some relatively small, say 10 or 20 dimensional vector of factor
activations with mixing matrix. And this model in order to run an infer.net would
require about 22 gigabytes of memory and several days to run. So I mean it
could be done, but it's not ideal if you want to do lots of experimentation.
So the solution was to parallelize this use in the cloning method so you split the
data into a number of separate chunks, you duplicate the model for each of those
chunks of the data and then you can do your message passing independently on
those chunks, reduce -- multiply all of these messages to get the master copy of
the marginals of the W matrix, then distribute out the results of that and you can
iterate until convergence. Run it on a 8 node, 64 core cluster and then we could
run the model in two hours using about three gigs per machine.
And the way that this was implemented was using managed code, using F# and
MPI.net which is the managed [inaudible] for Microsoft MPI running on Windows
HPC, and these two aren't so interesting, but I think the key point here is that
there's only one MPI command which is needed to actually parallelize this
algorithm and that's this all reduce command which takes a computer operation,
here it's doing an elementwise multiplication of arrays and then from each
processors point of view, it's multiplying in that processors contribution which is
the messages from chunk, which in this case messages from chunk and then
farming out the result to all of the processors.
And the next example is I recommend a system called Matchbox which we've
developed in our group. The idea here is we want to be able to do large scale
personal recommendations to users of Web service, so different types of things
but items, services, Web pages, and want to be able to do this based on
feedback information that users have given about items so they might like some
items, not like other items. And maybe they get their feedback implicitly in the
things they click on on Web pages. That type of thing. So the model for this is
actually fairly similar to the factor analysis model. Here we assume that each -for each user we have a set of features. So for example, they're male and their
British and each of those features is associated with some latent weights which
here I call U11 and U21. And by adding those weights together we generate
something which we call a trait for that user.
And we can repeat this structure a number of times to generate a set of traits for
the users. So we have a set of linear models for the user which gives us the
vector of traits for that user. And we could do the same thing for items. So for an
camera we have a set of linear models which combines the trait contributions
from each of those features to generate a vector of traits.
So we have a vector of traits for users and a vector of traits for items.
And the model is the rating or the value of this item for this user is given by the
inner product between those vectors.
And if we have prior distributions like here we use Gaussian prior distributions on
the values of all of the -- these latent variables, the U variables and the B
variables, then we can make predictions about what items the user will like using
message passing and if we have data we can update our beliefs in the light of
that data again using message passing.
So we run this on a couple of standard data sets. And it achieves some pretty
good performance. I think the most important thing for this talk is that again we
can -- we can use the same method to parallelize it. We get it to be about four
times faster on 8 cores which ->>: [inaudible] traits of the person and the traits of the item it's not a function of
similarity between other people who have rated similar things.
>> David Stern: So it learns so it learns a mapping of the users and items into a
latent trait space based on what it's seen in the past. And that mapping is given
by this linear -- this linear -- this set of linear models for each user.
>>: A small number of parameters? So it's not a lot of parameters?
>> David Stern: It's the number of parameters will be the number of features
times the dimensionality of the trait vector. So you might potentially have many
millions of features, a feature which is actually included, it will actually be the
identity of the user. So that will be a vector of one in one place and zero
everyplace else.
>>: [inaudible] everybody simultaneously.
>> David Stern: Yes. And I think the main advantage of actually parallelizing the
training of this model for us is actually giving us increased agility for
experimentation. It's very useful to be able to run lots of experiments with this
type of model experiment with different features and then try something else and
be able to make it -- be able to run it in two hours rather than eight hours makes
a big difference there.
The final example is the one which is sort of potentially most business
interesting, and that's AdPredictor. AdPredictor is a method of predicting
probability of click on ads in paid search. So this shows a result page for live
search and these are the ads. And the advertisers have bidded for this particular
keyword which has been searched for Seattle in order to try and participate on
this page.
And when we place the ads on the page, we place them according to the
expected revenue. If a user was to click on that ad. And to get that expected
revenue, you have to multiply that bid, which is this, by the probability of click.
So you need a way of estimating probability, the probability that a user will click
on this ad. So we display according to expected revenue. And we also charge it
such that we have to use these estimated probability of clicks, we charge the
user such that they would just maintain their order in this display ranking. It's
called a second prize auction.
So there's advantages of improving our estimates of probability of click. We can
increase user satisfaction by targeting better charge advertisers more fairly and
increase revenue by actually showing adds which are going to get clicked on
more. And we have to do it at a rate of around 3,000 impressions per second at
the moment.
So the more basically the faster we can do this, the more data we can use and
the better predictions we can make. The model is actually very straightforward.
It's just a linear model based on a set of features. So for example you might
have IP of the client or a match type which is a feature which takes into account
how well the query matches the key words. And to get the USC, it's a linear
combination of the active value for these features contributed to the probability of
click.
So the model is a linear model and a set of weights which I call W1 and W2 for
the active features. You have Gaussian pros on the values of these weights and
you push -- once you've summed up the values of the weights which are present
for this user on this Web page, you then push the sum through a probed function
which converts it from being a number on the real line to being in 01, so it's
probability. And we can use message passing to update our parameters in the
lights of the observations of clicks or no clicks. And it gives calibrated predictions
which is important. So if you -- if you look at all of the -- if you look at a bucket of
impressions for which we predict a particular probability of click, the actual
fraction of those ads, if those impressions in that bucket corresponds to the
predicted probability of click, the actually fraction which were clicked on
corresponds to the probability of click.
And again, we use this cloning method of parallelization. And here, this is the
accuracy question comes in here because with our predictor, we need to do it in
the single pass through the data so it can be trained online. So as you increase
the number of processes, you slightly reduce the accuracy, but that will be
outweighed by the extra amount of training data you can use and it's -- you can
get with 8 cores you can get about a 4 time speedup again.
So that's the rest of -- that's all of the applications. I presented some work on
parallel Monte Carlo Go. Monte Carlo Go gives stronger play, the more
computer time is available. So parallelism can be easily exploited there.
And I talked about a number of applications of parallel message passing on
factor graphs where messages are subcomputations of inference which can be
distributed. I talked about two methods for actually parallelizing, locking and
cloning. And applications in biology and online advertising.
And I think this sort of the summary message of the talk is that these methods for
doing inference on probabilistic model both Monte Carlo methods and the
message passing, which are robust to changing the order of computations.
So if you have multiple cores, the optimal schedule for computation might
actually be different to be out on schedule if you have just a single core and
everything still works. So that's it. Does anyone have any questions?
>> Jim Larus: Let's thank the speaker.
[applause]
Related documents
Download