am not Dan Fay. Dan is the host of... family situation, not an emergency, but had to take care... >>:

advertisement
>>: All right. Good afternoon. Welcome, everybody, thanks for joining us here. I
am not Dan Fay. Dan is the host of today's lecture and he had to run away for a
family situation, not an emergency, but had to take care of some family business.
My name is Alex Wade. I am here with External Research and it is my pleasure
to welcome today David Anderson.
David received his PhD in Computer Science from the University of Wisconsin in
'85. He's taught in the comparative science department at Berkeley. U.C.
Berkeley he worked as several startups and then has been back at U.C. Berkeley
as a research scientist since then. His work focuses on citizen cyber science,
using the internet to involve the global public in scientific research. And he's
currently lead of the BOINC Project, the Berkeley Open Infrastructure for
Network Computing, which develops widely used middleware for volunteer
computing. And he's also involved in creating new technology for distributed
thinking and web-based education.
I'd like you to join me all in welcoming David Anderson to Microsoft Research
today.
(applause)
>> Dr. David Anderson: Thanks very much, Alex. So today I'm going to talk
about the field of Volunteer Computing, where we're at today, where we're going
in the next couple of years. And at the end I'll say a couple words about my new
interest in involving people themselves, rather than their computers, to do
science.
This slide shows the history of volunteer computing. Volunteer computing is a
form of scientific computing where the resources, the processing powers are
volunteered by computer owners. And the projects that got this started in 1996
were kind of proof-of-concept things, looking for prime numbers and breaking
cryptosystems.
Seti@home and Folding@home were the first ones that we're doing, kind of new
science or real science and they also added graphics to show people what was
going on in their computer and that seemed to make a big difference. They got
really big.
These early projects all developed their own infrastructure software, the software
that manages the distribution of jobs and acts like a screen saver and things like
that. And that turned out to actually be a lot of work, so a bunch of groups came
up with the idea of making platforms to facilitate this sort of thing.
The first several of these were commercial. People tried to figure out how to
make money off of volunteer computing. None of them succeeded. In 2002, I
started to work on the BOINC Project, which was an open-source middleware
platform for volunteer computing and that led to a proliferation of volunteer
computing projects starting in about 2004.
So currently there are maybe 40 or 50 fairly large-scale volunteer computing
projects. It still kind of dominated by Seti@home and Folding@home, but there
are a bunch of other fairly large ones these days.
These -- the applications that these projects do and the kind of science that
they're doing runs the gamut of -- across pretty much all areas of computational
science. A lot of them are involved with computational biology, things involving
protein folding, which is sort of simulating how proteins develop out of gene
sequences or virtual drug design, which is figuring out how molecules bind with
proteins. A lot of this has applications to very practical problems like developing
vaccines or drugs for human diseases.
There are projects studying for example how malaria spreads, improving models
of the spread of diseases to figure out how we can spend prevention dollars
optimally. Some Volunteer Computing projects, well, there's one from
SERN(phonetic) that has run simulations of the large head-on collider, both the
accelerator part of it and the simulating collisions in the vectors. There are some
projects that study climate change and global warming. A number of projects
involving different parts of astronomy, settee, Einstein@home is looking for
gravitational waves using the new LIGO detector.
Some projects do mathematics, various root four searches. One could also be
used to not necessarily do large-scale computing, but just as a way of deploying
a program in a lot of computers. So there's an interesting project called the
Quake Capture Network from Stanford, which uses the accelerometers in laptop
computers as pieces of distributed seismograph and you can actually detect
earthquakes earlier than you could otherwise because of the proximity of
computers to the earthquake.
And I should point out that almost all these applications were not developed for
volunteer computing. They were programming that the scientists were already
using and in some cases they had to do some work to get them to run on
consumer platforms. So, for example, the climate study projects had to take
these climate models, which are huge multi million line Fortran programs that had
previously only run on super computers and try to get them to compile for
Windows and that sometimes takes a certain amount of work.
Also, all of the applications are what were called Bag of Task applications. They
involve a bunch of independent jobs that don't communicate with each other.
There is also a possibility of running MPI type programs on a restricted set of
computers. Most of the applications are more compute intensive than data
intensive, though as consumer networks become faster that limitation is
disappearing.
So Volunteer Computing is a way to get a lot of computing cycles. And the way
that -- the units that people talk about computing power in these days, the menus
or TeraFLOPS and PetaFLOPS, which are thousands of TeraFLOPS. The first
computation to exceed the PetaFLOP barrier was Folding@home, which is a
volunteer computing project. That was reached last fall and it happened many
months before the first super computer achieved a PetaFLOP throughput.
The union of projects that use BOINC, I should say that Folding@home is one of
the early projects that developed their own infrastructure software. The totality of
BOINC-based projects exceeded the PetaFLOP barrier early this year and it's
currently averaging 1.2 PetaFLOPS and I should probably make a couple of
comments here. The Folding@home computing power, the bulk of it currently
actually comes from Sony PlayStation 3s, the cell processor in that does about
100 GigaFLOPS so they get a lot of power from that. And recently they also
have developed a version of their ap that runs on NVIDEA GPUs and with a fairly
small number of computers that is now providing 40% of their total power.
BOINC is still almost entirely CPU based. There are about 570,000 computers
running BOINC around the world. The majority of those are Windows so the vast
majority of scientific computing done on Windows is done using volunteer
computing right now.
So the -- in addition to being a source of a lot of FLOPS, Volunteer Computing is
a really good deal. If you work out the expense of getting one TeraFLOP per
second of computing power over an entire year and you look at different
approaches the approximate numbers are like this. If you build a cluster yourself
and you buy the hardware and the networking equipment and you pay the
electricity bills and you pay system admins to keep the system running, it ends
up being about $124,000 per year per Teraflop.
Cloud computing is another way to get cycles. The Amazon Elastic Computing
Cloud that, Teraflop would cost you one and three quarter a million dollars. If
you look at the 10 largest BOINC projects their expenses are really dominated by
hiring assist admin to keep the server running. It's a BOINC project has a server
which is typically a Linux box or a few Linux boxes. The hardware costs are
almost nothing and the cost per Teraflop year is about $2,000.
>> Question: How much electricity are the (inaudible) ->> Dr. David Anderson: That's a difficult question to answer. In many cases
the computation's done while the computer is on anyway. Yeah. It will use about
30 more watts if it's doing a flowing point intensive computation.
There's no doubt that a large amount of that electricity cost, which is lumped into
the cluster, is -- somebody's paying it, it's just not the scientists. It's being spread
out among a lot of computer owners.
So the real goals of Volunteer Computing, it's not to set records or to break
barriers, it's to facilitate new science and to in particular to allow science to be
done that wouldn't happen otherwise because people couldn't afford it. If you
look at the top 500 super computers, historically most of the top ones are owned
by defense labs and are used for nuclear weapons research.
Volunteer Computing is a way to get a lot of cycles if you're working in an area
that's underfunded, like you are doing settee, or you are studying diseases like
malaria that don't have a lot of money behind them. Or if you are a scientist
working in a country that doesn't have much computing infrastructure or you're
doing science that's speculative and unpopular in the current political
environment of your area.
Volunteer Computing gives power to scientists who can explain their research to
the general public and convince the public that it's worth this investment in their
electrical bill to help the scientists. And conversely what we're trying to do is to
create an environment where computer owners have a wide choice of things that
they can participate in and they will try to make an informed decision by actually
going out and learning about the science that these various projects are doing
and that that process will increase their awareness of scientific research and their
interest in it.
So that's the idea. I wouldn't say we've made really good progress. The
numbers that we have right now, there's roughly a half million people
participating in Volunteer Computing, about a million computers, but that's out of
maybe a tenth of a percent of the one billion or so internet-connected PCs right
now. And on the science side 50 or so projects is a tiny fraction of the scientists
who could potentially benefit from Volunteer Computing.
So let me say a little bit about the -- about reaching the ExaFLOP barrier, which
is a thousand PetaFLOPS. And this is something that if you think about building
an ExaFLOP super computer in the traditional sense of having a bunch of
hardware in a room, it's not really feasible. The economics for maybe 10 or 15
years, it's going to take too much power and too much money. But we could
potentially reach the ExaFLOP barriers in Volunteer Computing a lot sooner than
that. To start to think about this we need to consider not just CPUs, but other
kinds of computing devices. And I'm going to go through a few of these and talk
about their potential to supply computing power.
So if we flow these different kinds of devices we need to think about its
performance and how fast the performance is going to turn over time. How easy
it is to program for scientists, though the energy-efficient issues will become
more important, and there's also pragmatic issues of deploying volunteer
computing on these different kinds of devices, some of which are controlled by
single companies.
So CPUs will continue to be an important part of the cycle pull for a while. Their
contribution will come more and more from multiple cores and the value of having
an application that uses a lot of cores versus trying to run a bunch of separate
jobs on individual cores, mean that using this power won't require running parallel
APs that will become increasingly important.
In addition, there's a lot of mechanisms being developed to save electricity, which
means computer will shut down or go into low power modes and their availability,
the fraction of time that you can do computing on them, is going to shrink. I
mean, the model is definitely -- the computing will happen while the user's at the
computer in some zero priority mode that stays out of their way, as opposed to
the old model where when your screensaver kicks in, that's when the computing
starts.
Moving from the current million or so PCs to tens of millions of PC system going
to take the help of somebody like Microsoft or a computer manufacturer or a
media company. Basically we currently have the market of computer enthusiasts
pretty much saturated with Volunteer Computing, but reaching sort of the
average person who uses the computer as an appliance, that's the hard part.
The next type of device, which is actually I think the most interesting, is GPUs.
This picture shows why GPUs have a performance advantage. They don't have
to devote a whole lot of transistors to caching and creating the illusion of random
access main memory, the processor speed.
So for example, the current NVIDEA chip does about 500 GigaFLOPS, half a
TeraFLOP, that's maybe 50 or 100 times faster than the CPU on a typical PC
that it's in. Programming GPUs to do scientific programming has become easier
recently with the introduction of Kuda, videos of C-based environment for GPU
programming and Apple has announced something called Open CL, that seems
to have the same goals.
So we could get an ExaFLOP, 10 to the 18th flops per second, if we had four
million GPUs and each one did a TeraFLOP, which is going to be the case a year
or two from now, and we had 25% availability running a quarter of the time.
There is an ExaFLOP and that could conceivably happen in three or four years.
Another type of resource is video game consoles, which are becoming faster as
people want to have more realistic games and so forth. These could potentially
provide nontrivial computing power, but they're inconvenient for a couple
reasons. They can be hard to program, in the case of the Sony PlayStation 3.
And in general they tend to have closed environments that are hard to get your
program deployed on. So my crew calculation is that they could potentially give
us a quarter of an ExaFLOP in a few years.
Another kind of device that people have been thinking about recently because of
energy efficiency issues is mobile devices, things like cell phones and PDAs,
media players, the Amazon Kindle. Currently these are sort of discreet
categories, but internally they're converging to the same sort of hardware and the
processors in these are designed for low energy consumption. So in terms of
FLOPS per watt, these are the best thing going.
I should say we are thinking about -- one can consider using these while they're
recharging. You wouldn't want to have your battery being zapped by running a
scientific program while the thing is in your pocket. The software environment is
problematic. Things like Google's Android, the proposed Open Source
environment for cell phones, is a possibility, though currently that requires doing
everything in Java.
And with a lot of these there's currently over three billion cell phones and it will go
up to five billion pretty soon. But they're so slow that actually even if you got all
of them it's a lot les computing and then you could get out of GPUs. So it's an
important idea, but I think that GPUs are more important.
Similarly, appliances like home media players are moving to a full feature PC
inside the box. So cable set-top boxes and blue ray players are typical of this
category. The software environment is converging to a Java based platform. So
again we have the problem of not being able to compile your Fortran program for
it. So they could potentially give us a fraction of an ExaFLOP.
Okay. So that's the summary of why I think Volunteer Computing is the quickest
path to reaching the ExaFLOP milestone.
Let me talk a little bit about the BOINC project, which I lead at Berkeley, which
essentially provides an operating system for doing volunteer computing. We run
a little tiny project, me and one of my half other programmers. Most of what we
do is developing technology. We write software. And to fill a vacuum we also
tried to enable a variety of online communities related to volunteer computing.
So we run a bunch of e-mail lists and message boards for people who do things
like writing translations or providing customer service, customer support for
users, doing testing. The task of testing our software on all the popular platforms
in the world is way more than we can do ourselves. We have a lot of volunteers
to do it.
Let me just kind of quickly describe what BOINC is, what the software itself is,
and some of our development efforts these days. There's two halves to the
BOINC software, the server part and the client part. The server part consists of a
job processing mechanism and the key component here is the scheduler. It's
basically a batch curing system that has to have extremely high capacity. Many
of these projects have hundreds of thousands of clients and they need to be able
to handle hundreds of requests per second and issue them a single job or maybe
several jobs.
So out of a database that at a given point may have a million or so jobs in it the
key aspect of the architecture is that the schedulers are insulated from the
database and there's a cache of jobs in shared memory, which is replenished by
a separate program and the schedulers, instead of having to go to the database
and find jobs that are appropriate for that particular client, can just look in the
cache and that works really well. And the BOINC scheduler is able to dispatch in
the order of 10 million jobs a day, which is very important.
The other part of the BOINC server software is a whole bunch of PHP code that
provides a website for the Volunteer Computing project. And this is real
important because one of the things that brings people in to volunteer their
computer and that keeps them going year after year are community -- well,
competition mechanisms and community features of various sorts. So people
like to keep track of how much work their computer has done, compete with other
people, form teams, compete among teams, talk with other people about the
science that's going on and all sorts of things. And these are really critical
functions.
We've tried to make it really easy to set up a computing project using BOINC so
you can either set up a Linux box and install the BOINC software on it and port
your application to our API, which is very easy. And in a day or so you can have
a project up and running.
Even easier than that, we've created a VMware virtual machine image that has
all the BOINC software already there and everything that it depends on, like the
proper versions of My Sequel and PHP and so forth and you can run that virtual
machine on any computer you want and get things going even faster.
If you want to avoid even worrying about hardware it's generally the server that
you run the stuff on you want it to be highly available and scaleable and so forth.
We're working on developing a virtual machine image for the Amazon Elastic
Computing Cloud so that you won't even have to worry about hardware anymore.
So we're trying to really reduce the barriers to entry for Volunteer Computing as
low as we possibly can.
Client software -- well, first of all it runs on all popular computing platforms. It's
really designed so that a nontechnical computer owner can install it with one
click, not do any configuration at all, and have it work indefinitely with no
intervention from the user.
There's also of course big category of users who want to configure things and
want to have a lot of knobs to turn and so forth. The actual internal structure of
the software is a bit involved. It looks to the user as a single entity, but the
programs that make up the client software, this picture show what is they are.
The central program is what we call the core client. It's in charge of doing all
network communications, talking to servers, getting jobs, downloading files. It's
in charge of doing CPU scheduling, deciding when to run applications.
The different pieces, the Gouie(phonetic), while there's a -- what we call the
BOINC manager, which shows you kind of a spreadsheet style picture of what's
going on and if you want there is also a screen saver, both of these can run
application graphics. So an application actually consists of two parts, the part
that does the scientific computing and if it wants it can have a separate piece that
does graphics and talks through shared memory so you can see the current state
of the computation.
The interfaces between these, the Gouie(phonetic) controls the core client
through RPCs over a TCB connection, so in fact you can use the
Gouie(phonetic) to control clients on remote hosts. Now the bigger picture, if you
install the BOINC client software on a PC initially it doesn't do anything. You
have to then do what we call attach the client to whatever set of projects you
want.
So like I say, there's on the order of like 50 projects out there and we want
people to go out and read their websites, learn the science that they're doing,
decide which ones they think are important and then you can attach your
computer to any subset of the projects. An attachment has an associated
weight, which says how much of your resources you want to go to the different
projects. So you could spent 80% of your time studying the climate and 20%
doing some sort of biomedical stuff or whatever you want.
The key idea is that these projects are completely independent of each other.
There's no central BOINC authority. There's not even an official listing of all
these projects anywhere. Each one has its own server. They're identified simply
by the URL of their website.
Now the goal of this model is promote kind of an ecosystem where new projects
are constantly arising and disappearing and volunteers are constantly learning
about new projects and assessing their priorities. In practice this is kind of
difficult because the only way that people have for finding out about new projects
is by Googling or word of mouth or something like that.
We've developed a framework that sports intermediate websites called account
managers, so the idea is that an account manager provides sort of one-stop
shopping for Volunteer Computing. It's a website where you can go and look at
all the available projects, summarized in some way and their research described.
And you can attach to them just by clicking checkboxes as opposed to having -to go out and survey a whole bunch of separate websites.
To make this possible, the BOINC server software provides a set of web services
so that account managers can create accounts on the different projects and look
up stuff, manipulate stuff. There's currently two of these. One is called Rude
Republic, the other is called the BOINC Account Manager.
Some of the sort of technical work we've been doing recently has -- it's become
clear that sporting GPUs is very important and also sporting multi thread
applications that can use multiple cores is important. We've had to kind of
revamp the internal architecture of BOINC. Originally the idea is that a client
talks to a server and it tells the server what its platform is or it can actually give it
a list of platforms so it may be able to run both Win 64 and Win 32 applications.
The server has a bunch of jobs. Each job is associated with an application, not
with a platform or version or anything like that. The server goes through and
finds the -- whatever it thinks that the best platform would be and sends the client
that particular application.
Well, with the advent of multi threaded and coprocessor applications we needed
to generalize this so in the current architecture a given platform, like Win 32, may
have a whole set of different application versions that run in that platform. So
there could be one which is optimized for sequential processing, one that is a
multi-core version, one that uses coded GPU, maybe another one that uses both
Multi-core and Kuda.
Now the problem, of course, is figuring out for a given client which of these
different alternate application versions is best. And instead of embodying that
intelligence in BOINC itself, we came up with architecture where the project
supplies a sort of a plug-in function that goes into the scheduler, which takes this
input, a description of that particular host, it's CPUs, it's co-processors,
everything else about it is hardware, and goes through these different available
versions and for each one decides how many cores will that application be able
to use in that machine? You know, how many co-processor instances? How
many flops does it expect the application to get running in that machine?
And then we can use that information first of all to pick whatever the best version
is. And secondly, to get an estimate for how long a job is going to take to run
there, which is critical to make a good scheduling decision.
Just kind of jumping around here, another thing we've been working on here is
improving our replication algorithms. So one interesting thing about Volunteer
Computing is you're using computers that are essentially anonymous and you
can't trust them. You can run a program on them, you get back an answer. You
can't even be sure that what you get back is a result of running your program.
There are a few bad apples, who will intentionally send you back wrong stuff or
output of a previous run of the program or something like that.
In addition, a lot of -- when you're dealing with this number of computers a lot of
them have hardware problems, especially those that are overclocked. They can
get floating point errors that don't crash the system, but they give you bad
answers for the scientific program.
So one way to deal with this is to do replicated computing. Take the same job,
run on two different computers, compare the answer and accept it only if they're
the same. It's actually trickier than that because different computers actually
don't do floating point math the same and if you run the same program on an
Intel on an AMD processor you can get back wildly different answers, especially
for unstable computations.
Anyway, so replication, we've worked it out and it's a way to increase the trust in
your results to whatever level you want, but it wastes computing power. So one
thing we're working on right now is a more intelligent system where do replication
only some of the time. And if we're sending jobs to a host that has built up trust,
then with a certain probability don't replicate that particular job.
So the policy of the rule here right now maintains an estimate of the kind of error
rate for a host. And if we have -- we're sending a job to a host where -- whose
error rate is above that we always replicate, otherwise we replicate with a
probability proportional to the error rate.
So the idea is that the host earns a certain good reputation and after that we
don't always replicate, but every now and then we mix in replication. It's a little
bit unclear whether there are counter strategies that would defeat this scheme
and if anybody is into this kind of stuff, I would enjoy talking to you.
BOINC involves some very intricate scheduling policies. There's actually two
interacting schedulers in BOINC, one that runs in the client and the other running
in the server. The client has to decide when to get new work and what project to
get it from and how much to ask for and at any given point it has to decide how to
schedule the CPUs among existing jobs.
This can be kind of tricky when you have computers that are disconnected from
the network some of the time. You need to prefetch enough work to keep them
busy while they're not connected. If you fetched too much work you can miss
deadlines. You can in the presence of replication, you can cause other replicas
to sort of get delayed for a long time until you can validate them.
So similarly there are a bunch of server scheduling problems can be very hard.
Until now we've mostly experimented with scheduling policies by coming up with
something that sounded feasible and deploying it on a running project. The
problem with this is that if you -- if you make mistakes you can waste a lot of
computing time and get a lot of people angry at you. And the other problem is
that there's a lot of factors going on at once and you can't really be sure that the
result you see is because of the change that you made.
So we're developing two simulators, essentially one to study the client and one to
study the server. So for example the server simulator consists of a program that
emulates a huge number, like hundreds of thousands or millions of clients, and it
models things like their error rate, the churn rate, the process of people joining
and leaving the project, the distribution of speeds of the computers and so forth.
And that actually plugs into a real BOINC server, so to maximize the accuracy of
the simulation we don't simulate the server, we use a real-live server with the
actual programs and the database behind it and that actually runs fast enough
that you can simulate it about 100 times realtime.
We also do a lot of work on the community and competition, the volunteer facing
features of BOINC. And in the past year we've added a bunch of features that
sort of emulate the -- what you might find in a typical social networking website.
The ability for people to make lists of friends and have people of each other and
send messages around. A lot of stuff related to teams. Teams turn out to be a
surprisingly powerful mechanism to motivate people in Volunteer Computing and
we've sort of souped up the idea of teams so that there can be structure within a
team. You can have administrators, kind of lieutenants, as well as the master of
the team, the ability for a team to have its own private message board and things
like that.
We're also figuring out ways to introduce volunteer computing into the big social
networking sites, like having applications that Facebook applications so that
people can see when somebody has joined a new volunteer computing project or
when somebody's total amount of credit has passed a milestone that will show up
on their list of events. And to make it -- yeah?
>> Question: Can you describe how a team knows that it's beaten another
team?
>> Dr. David Anderson: Well, there's elaborate features for a list of teams and
either they're total credit or they're recent average credit. You can filter teams by
holy -- you know, certain countries or by company teams and university teams.
You can group things in various ways. The other general mechanism is that in
addition -- we don't really want to -- people to think in terms of individual projects.
We want to think of their totals across all of the BOINC projects, so we have a
fairly elaborate system where the projects export all of their credit statistics in
XML files and these third-party websites aggregate that data and show various
forms of competition on leader boards, summed over all the blank projects.
So like I say, currently there are about 50 BOINC-based projects. This number is
embarrassingly small. I hoped that there would be a thousand by this point. One
reason for this is that even though we've made it super easy to create a BOINC
project, it's still beyond the capabilities of the average computational scientists.
People who do scientific computing are not computer whizzes, they're not assist
admins. In many cases they're not actually programmers.
An idea of volunteer computing projects being operated by a single scientist or a
single research group, I think we've sort of hit the limit of how far that model will
go because there's not that many research groups that have the resources to do
something like this. There's a bunch of other organizational models. My favorite
is what I call the campus level meta-project.
So the idea and this is being deployed at University of Houston. Hopefully that
will inspire other people to do it. Is a volunteer computing project that is operated
at the level of a university. It handles applications from all of the scientists at that
university. The servers are operated by sort of a central group, the website and
so forth. And it's promoted -- well, first of all, the computers of the university
itself, like the lab machines and so forth would be configured to run that project to
support the university's own research -- would be promoted to the university
students so University of Houston, for example, is 40,000 undergrad and grad
students and hopefully a lot of those would want to support university research
on their machines.
It also has 400,000 alumni and this is the interesting number. Alumni have
school spirit that manifests itself in going to football games and sending in money
every year. It seems like it would be pretty straightforward to get these people to
run Volunteer Computing for the university on their computers, most of them own
computers. So I think that is kind of an ideal way to do things and I'm trying to
get some universities interested in that.
There's a few other organizational models. There's a project called
MyModeling.org, which essentially is centered around a particular application.
These are people who build giant list programs that model the human brain and
all of those different components. And there's one of these programs that's
called Acter and there's a lot of researchers that use this and that community has
come together to create a volunteer computing project that just runs that one
application and any of those scientists can feed jobs into it.
IBM Word Community Grid is another metaproject or an umbrella project
operated by IBM more or less as a PR exercise and it hosts a number of
applications from a variety of universities all around the world and there's a few
other things going on. In Spain there's a province called Extremadura where
essentially all the universities and research labs have decided to form a volunteer
computing project that spans all of them.
Okay. Let me kind of shift gears here and talk about my -- a recent interest of
mine, which is related to volunteer computing in the sense that we're trying to use
the public to help scientific research. The idea here is to use the people
themselves and their intelligence, their cognitive abilities, their knowledge, rather
than their computing power.
So a couple of years I got involved in a project at the space sciences lab where I
worked called Stardust@home, that had to do with finding particle of interstellar
dust in a chunk of aero-gel, this very odd material, that had been sent into space
and had collected a certain amount of commentary and interstellar dust and was
parachuted back to earth. The problem was that nobody knew where the dust
particles were or how many there were and they didn't even really know exactly
what they were going to look like. So it was impossible to -- well, we thought
about attacking this problem with computer vision, image processing and didn't
work.
So instead we set up a system where we would train people on what we thought
these dust tracks would look like, so they would look at microphotographs of this
aero-gel -- actually not just single photographs, but through a stack of
photographs at different focal planes. There's a knob where you can turn a focus
knob and what you expect to see is a little tunnel that goes through the aero-gel
and a little cave down at the end. And you can't actually see the dust particle but
you can sort of tell where it is. And this was wildly successful.
We got 23,000 volunteers who on average looked at 1600 of these focus movies,
each one takes maybe 30 or 60 seconds to look at. So they contributed a lot of
time to that, there's a lot of enthusiasm. And we were able to calibrate exactly
how well they were doing by introducing a certain random fraction of jobs that we
had synthetically created. So either images that we knew didn't have a dust
particle or image where is we had photo shopped in a dust particle with a certain
size and orientation. So 20% of the jobs are these calibration jobs and we were
able to really quantify how well the volunteers were performing, which in the end
turned out to be much better than individuals or graduate students could do.
And there have been a few other of these projects, which my word for it is
distributed thinking. Some people call it crowd sourcing. There's a project called
Galaxy Zoo, where people look at deep space images and identify different types
of galaxies. One interesting one from University of Washington is called
Forgit(phonetic) and the goal here is to take a complex protein molecule and sort
of fiddle around with it to reduce its potential energy. So this is what it looks like.
It kind of shows you the places in the molecule where there is an important space
that needs to be filled in. That is the yellow balloons or where atoms are too
close together, those are the red balloons. You can sort of grab pieces and try to
reduce the energy. And this is Forgit(phonetic) is deployed as a multi-player
online game so you can see a bunch of other people who are fiddling with the
same molecule at the same time and sort of race with them to try and get the
energy down.
This task of reducing -- of finding the low energy state of complex molecules
turns out to be one of these thing where is humans, some humans, can do better
than computers. So I started this project about six months ago of -- I guess I like
to write middleware so I decided to write a middleware platform for distributed
thinking to make it easy for scientists to make new projects like Stardust@home
and Forgit(phonetic).
And probably the central part of my service, the system I'm working on, is support
for learning about your volunteer population. You know, basically figuring out
who the savants are. There's going to be some fraction of volunteers who are so
bad at the task that the best you can do is to ignore their contributions and
probably a few people actually try to undermine your experiment.
And the other thing I'm working on these days is a platform to make it easy to
teach people in the context of distributed thinking and Volunteer Computing.
Both of these areas have the property that they have this giant pool of
volunteers, hundreds of thousands or millions of people from all over the world,
all different ages, education levels, interest, backgrounds and so forth, so
extremely diverse. And you have more or less a study flux of new volunteers,
typically several hundred or maybe a thousand new volunteers everyday. And
both for distributed thinking and volunteer computing, it's useful to try and teach
these people something. Either teach them about the science that you're doing
to get them more involved or in the case of distributed thinking to train them to do
one of these applications that could actually require a lot of knowledge, like
fiddling with proteins.
So the -- so you have this great diversity of the student population and because
you have a steady flux of students there is the opportunity to actually create
experiments. So let's say you have alternative lessons that teach a given
concept you could set up an experiment that runs these two lessons side by side
and it has an exercise after that. And at the end of a day or two you would have
collected enough data to give you some statistically significant information about
which one of the lessons is better or maybe one of the lessons works well for
some subset of your population, you know, like the older males or something like
that.
So that's the basic idea of this other system I'm working on called Bolt. Makes it
easy to set up experiments and to have training that constantly evolves to teach
more and more effectively and can also be adaptive.
Okay. So I guess that's -- I've reached the end here. So Volunteer Computing
has done some repressive things in the past year. This PetaFLOP barrier, the
possibility of moving towards ExaFLOP, but as far as I'm concerned it's really just
barely achieving the tiniest part of its potential and both in terms of reaching a
bigger voluntary population by one or two orders of magnitude and bringing in
more scientists.
Volunteer Computing is not been embraced by the high-performance computing
community or even the computer science community. You can go to the super
computing conference and you will not hear one single reference to Volunteer
Computing, unless I happen to be giving a talk there, which is rare. This idea of
distributed thinking, I think is real interesting and has -- it is at extreme infancy.
We need to get scientists thinking about how they could potentially use that and
these two things could actually potentially link together. You can imagine
workflows where distributed thinking processes data in some way that then feeds
into a computing system, possibly a volunteer computing system.
So if anybody is interested in either of these things, here's my e-mail address.
I'm real eager to think about starting new projects or to do research in either one
of these things. So thank you very much and I can answer any questions that
anybody might have.
(applause)
>> Question: I have a question. Are there names with co-results of the projects
used actually during computing? I mean, if I have some computations to make, I
don't cancel. I can't support a cluster myself and I decide to go for Volunteer
Computing. Are there current projects out there right now running on volunteer
computing? Are there significant results, things out there you could (inaudible) ->> Dr. David Anderson: Well, yeah, a lot of them. Volunteer Computing has an
image problem, which is that most people think of it and most people equate it
with settee at home. And they say, settee hasn't found ET, therefore, Volunteer
Computing is scientifically worthless. So to try and combat that I assembled a list
of publications of all the projects that use BOINC, results that were possible only
because of Volunteer Computing. And there is a good link number. There is
probably four or five papers in nature and science and PNAS and a total of 50 or
so papers. Folding@home has a huge number of papers, all of which were
enabled by Volunteer Computing.
So there hasn't been the one big discovery, you know. Nobody has discovered
the cure for cancer or extra terrestrial life so far, but there is a steady stream of
kind of standard-sized scientific results. It's the same as any other computing
resource, it's just cycles.
>> Question: Has anyone done anything with Flash or one of the other kind of
applications with a Web Browser where people don't install anything, it's just
while they're sitting at this web page -- computers doing something useful?
>> Dr. David Anderson: Yeah, there have been a few efforts in that direction.
There's a very -- really one using Java applets. Like I say, most of the projects
that use BOINC have some existing application which is if you're lucky it's in
C-plus-plus. More often it's in Fortran, so pragmatically that's sort of a limited
approach. Yeah?
>> Question: (Inaudible) sort of the scientist says like (inaudible) do they have
to adapt to (inaudible) or there two kinds of libraries that will (inaudible)
framework that they use or how do they (inaudible)?
>> Dr. David Anderson: There's different options depending on how much work
you want to do. The easiest option is to use what's called the wrapper, which -where you don't have to change your program at all. In fact, you don't even have
to have source code for it. And it runs kind of inside this wrapper that manages
the communication with the BOINC client.
If you want to -- for a real application typically you have to do checkpointing and
there's -- BOINC has a very small set of APIs that essentially communicate with a
client to tell when the aps should checkpoint and to acknowledge when the
checkpoint is finished.
If you -- if you want to do graphics, you know, to have your aps show something
in the screensaver, there's some APIs for that. So -- but they're all very easy to
use and there's Fortran, as well as CAC-plus-plus bondings. Some people, you
can also run Java applications under blanks, some people are doing that.
>> Question: (Inaudible) -- presence like how often like the growth rate this
year? Like Seti@home has been used I call (inaudible) to use it as (inaudible)
project like how much the user actually (inaudible)?
>> Dr. David Anderson: Yeah, we have -- we've collected -- there's a bunch of
different usage data that one can imagine. There's the churn rate, the
distribution of time that people actually participate. We studied that a little bit.
There's a curve which is pretty much what you would expect. We've studied the
availability, the sort of -- some people leave their computers on 24 hours a day
and BOINC is able to compute the whole time. Other people have BOINC set up
to compute only when they're not using the computer so there's these different
pieces of time. We've instrumented the BOINC kernel to collect all that data and
to log it and we're able to send it back to the server. And we do that with
Seti@home I think about six months ago. I generated usage data for about
100,000 computers and did some analysis of that. If you're interested in that I'd
be happy to supply any of that data.
As far as the fraction of malicious or malfunctioning computers, we could
probably reconstruct that, but we haven't done that so far.
>> Question: I mean, a rough approximation of how much cheating is going on,
order of magnitude
>> Dr. David Anderson: Order of magnitude is probably like two or three
people out of a million. It's really small, but if the others learn that people are
successfully cheating they get demoralized and it becomes known quickly. We
added a lot of functionality in BOINC to detect and defeat cheating early on. The
pre-BOINC version of Setting@home did not have any of those checks and there
was rampant cheating, which we had to deal with after the fact.
>> Question: So with the Bossat(phonetic) project, have you thought about or
are you just (inaudible) it sounds like it's exclusively for certain website. You go
to some website and you can evaluate the photograph or whatever. Have you
thought about what it would mean to host chat in a separate computation making
that framework open so that if you got a cell or whatever that someone has
already got a bunch of A to N, you could have them use their big human brain
and submit results over just the framework, like OSUI looks.
>> Dr. David Anderson: I haven't thought of that, but should be pretty easy to
do. If you have any ideas for what -- for a good potential project, let's figure out
how to do that.
>> Question: (Inaudible) possibly use it all having this (inaudible).
>> Dr. David Anderson: Bossat(phonetic) hasn't actually been used for
anything so far. It only started working about a month ago. The two pilot
projects just so you know, they're both image analysis, they're kind of interesting.
One of them is looking for hominid fossils, so when people like Louis Leaky and
so forth, go out and try to find fossils in Africa they do it by finding areas where
erosion has exposed fossil-bearing earth and then they just kind of walk around
looking down at the ground really closely and it's truing the rocks and pebbles
and every now and then you -- there will be a tooth mixed in there and you can't
see it unless you like stare right at it and these people become extremely good at
recognizing bone fragments. But it's very hard to cover a lot of area that way and
when you do cover it you trample stuff and you potentially break things.
So this guy named Tim White at Berkeley who's the new Leaky, we're working on
a project where we'll get photographs of large areas of these fossil-bearing
regions. The original plan was to have an unmanned airplane sort of crisscross
the thing. I think in the end we're going to have a human carried sort of raft that
has a bunch of cameras on a kind of a frame and a GPS device so you walk
along and it automatically clicks. And then we'll have these people look at these
pictures on the web and try to find fossils. That's one.
The other project involves annotating satellite images of Africa. There's this giant
amount of data of satellite pictures, but there's no information about like where
the roads, where the settlements, the crops, things like this. So we're setting up
a project where people can annotate those images and then scientists can learn
something from them.
It would be nice to think of a project that didn't involve images. Something that
involved, you know, some sort of AI knowledge kind of thing.
Anything else? Thank you.
(applause)
Download