am not Dan Fay. Dan is the host of... family situation, not an emergency, but had to take care... >>:

>>: All right. Good afternoon. Welcome, everybody, thanks for joining us here. I am not Dan Fay. Dan is the host of today's lecture and he had to run away for a family situation, not an emergency, but had to take care of some family business. My name is Alex Wade. I am here with External Research and it is my pleasure to welcome today David Anderson. David received his PhD in Computer Science from the University of Wisconsin in '85. He's taught in the comparative science department at Berkeley. U.C. Berkeley he worked as several startups and then has been back at U.C. Berkeley as a research scientist since then. His work focuses on citizen cyber science, using the internet to involve the global public in scientific research. And he's currently lead of the BOINC Project, the Berkeley Open Infrastructure for Network Computing, which develops widely used middleware for volunteer computing. And he's also involved in creating new technology for distributed thinking and web-based education. I'd like you to join me all in welcoming David Anderson to Microsoft Research today. (applause) >> Dr. David Anderson: Thanks very much, Alex. So today I'm going to talk about the field of Volunteer Computing, where we're at today, where we're going in the next couple of years. And at the end I'll say a couple words about my new interest in involving people themselves, rather than their computers, to do science. This slide shows the history of volunteer computing. Volunteer computing is a form of scientific computing where the resources, the processing powers are volunteered by computer owners. And the projects that got this started in 1996 were kind of proof-of-concept things, looking for prime numbers and breaking cryptosystems. Seti@home and Folding@home were the first ones that we're doing, kind of new science or real science and they also added graphics to show people what was going on in their computer and that seemed to make a big difference. They got really big. These early projects all developed their own infrastructure software, the software that manages the distribution of jobs and acts like a screen saver and things like that. And that turned out to actually be a lot of work, so a bunch of groups came up with the idea of making platforms to facilitate this sort of thing. The first several of these were commercial. People tried to figure out how to make money off of volunteer computing. None of them succeeded. In 2002, I started to work on the BOINC Project, which was an open-source middleware platform for volunteer computing and that led to a proliferation of volunteer computing projects starting in about 2004. So currently there are maybe 40 or 50 fairly large-scale volunteer computing projects. It still kind of dominated by Seti@home and Folding@home, but there are a bunch of other fairly large ones these days. These -- the applications that these projects do and the kind of science that they're doing runs the gamut of -- across pretty much all areas of computational science. A lot of them are involved with computational biology, things involving protein folding, which is sort of simulating how proteins develop out of gene sequences or virtual drug design, which is figuring out how molecules bind with proteins. A lot of this has applications to very practical problems like developing vaccines or drugs for human diseases. There are projects studying for example how malaria spreads, improving models of the spread of diseases to figure out how we can spend prevention dollars optimally. Some Volunteer Computing projects, well, there's one from SERN(phonetic) that has run simulations of the large head-on collider, both the accelerator part of it and the simulating collisions in the vectors. There are some projects that study climate change and global warming. A number of projects involving different parts of astronomy, settee, Einstein@home is looking for gravitational waves using the new LIGO detector. Some projects do mathematics, various root four searches. One could also be used to not necessarily do large-scale computing, but just as a way of deploying a program in a lot of computers. So there's an interesting project called the Quake Capture Network from Stanford, which uses the accelerometers in laptop computers as pieces of distributed seismograph and you can actually detect earthquakes earlier than you could otherwise because of the proximity of computers to the earthquake. And I should point out that almost all these applications were not developed for volunteer computing. They were programming that the scientists were already using and in some cases they had to do some work to get them to run on consumer platforms. So, for example, the climate study projects had to take these climate models, which are huge multi million line Fortran programs that had previously only run on super computers and try to get them to compile for Windows and that sometimes takes a certain amount of work. Also, all of the applications are what were called Bag of Task applications. They involve a bunch of independent jobs that don't communicate with each other. There is also a possibility of running MPI type programs on a restricted set of computers. Most of the applications are more compute intensive than data intensive, though as consumer networks become faster that limitation is disappearing. So Volunteer Computing is a way to get a lot of computing cycles. And the way that -- the units that people talk about computing power in these days, the menus or TeraFLOPS and PetaFLOPS, which are thousands of TeraFLOPS. The first computation to exceed the PetaFLOP barrier was Folding@home, which is a volunteer computing project. That was reached last fall and it happened many months before the first super computer achieved a PetaFLOP throughput. The union of projects that use BOINC, I should say that Folding@home is one of the early projects that developed their own infrastructure software. The totality of BOINC-based projects exceeded the PetaFLOP barrier early this year and it's currently averaging 1.2 PetaFLOPS and I should probably make a couple of comments here. The Folding@home computing power, the bulk of it currently actually comes from Sony PlayStation 3s, the cell processor in that does about 100 GigaFLOPS so they get a lot of power from that. And recently they also have developed a version of their ap that runs on NVIDEA GPUs and with a fairly small number of computers that is now providing 40% of their total power. BOINC is still almost entirely CPU based. There are about 570,000 computers running BOINC around the world. The majority of those are Windows so the vast majority of scientific computing done on Windows is done using volunteer computing right now. So the -- in addition to being a source of a lot of FLOPS, Volunteer Computing is a really good deal. If you work out the expense of getting one TeraFLOP per second of computing power over an entire year and you look at different approaches the approximate numbers are like this. If you build a cluster yourself and you buy the hardware and the networking equipment and you pay the electricity bills and you pay system admins to keep the system running, it ends up being about $124,000 per year per Teraflop. Cloud computing is another way to get cycles. The Amazon Elastic Computing Cloud that, Teraflop would cost you one and three quarter a million dollars. If you look at the 10 largest BOINC projects their expenses are really dominated by hiring assist admin to keep the server running. It's a BOINC project has a server which is typically a Linux box or a few Linux boxes. The hardware costs are almost nothing and the cost per Teraflop year is about $2,000. >> Question: How much electricity are the (inaudible) ->> Dr. David Anderson: That's a difficult question to answer. In many cases the computation's done while the computer is on anyway. Yeah. It will use about 30 more watts if it's doing a flowing point intensive computation. There's no doubt that a large amount of that electricity cost, which is lumped into the cluster, is -- somebody's paying it, it's just not the scientists. It's being spread out among a lot of computer owners. So the real goals of Volunteer Computing, it's not to set records or to break barriers, it's to facilitate new science and to in particular to allow science to be done that wouldn't happen otherwise because people couldn't afford it. If you look at the top 500 super computers, historically most of the top ones are owned by defense labs and are used for nuclear weapons research. Volunteer Computing is a way to get a lot of cycles if you're working in an area that's underfunded, like you are doing settee, or you are studying diseases like malaria that don't have a lot of money behind them. Or if you are a scientist working in a country that doesn't have much computing infrastructure or you're doing science that's speculative and unpopular in the current political environment of your area. Volunteer Computing gives power to scientists who can explain their research to the general public and convince the public that it's worth this investment in their electrical bill to help the scientists. And conversely what we're trying to do is to create an environment where computer owners have a wide choice of things that they can participate in and they will try to make an informed decision by actually going out and learning about the science that these various projects are doing and that that process will increase their awareness of scientific research and their interest in it. So that's the idea. I wouldn't say we've made really good progress. The numbers that we have right now, there's roughly a half million people participating in Volunteer Computing, about a million computers, but that's out of maybe a tenth of a percent of the one billion or so internet-connected PCs right now. And on the science side 50 or so projects is a tiny fraction of the scientists who could potentially benefit from Volunteer Computing. So let me say a little bit about the -- about reaching the ExaFLOP barrier, which is a thousand PetaFLOPS. And this is something that if you think about building an ExaFLOP super computer in the traditional sense of having a bunch of hardware in a room, it's not really feasible. The economics for maybe 10 or 15 years, it's going to take too much power and too much money. But we could potentially reach the ExaFLOP barriers in Volunteer Computing a lot sooner than that. To start to think about this we need to consider not just CPUs, but other kinds of computing devices. And I'm going to go through a few of these and talk about their potential to supply computing power. So if we flow these different kinds of devices we need to think about its performance and how fast the performance is going to turn over time. How easy it is to program for scientists, though the energy-efficient issues will become more important, and there's also pragmatic issues of deploying volunteer computing on these different kinds of devices, some of which are controlled by single companies. So CPUs will continue to be an important part of the cycle pull for a while. Their contribution will come more and more from multiple cores and the value of having an application that uses a lot of cores versus trying to run a bunch of separate jobs on individual cores, mean that using this power won't require running parallel APs that will become increasingly important. In addition, there's a lot of mechanisms being developed to save electricity, which means computer will shut down or go into low power modes and their availability, the fraction of time that you can do computing on them, is going to shrink. I mean, the model is definitely -- the computing will happen while the user's at the computer in some zero priority mode that stays out of their way, as opposed to the old model where when your screensaver kicks in, that's when the computing starts. Moving from the current million or so PCs to tens of millions of PC system going to take the help of somebody like Microsoft or a computer manufacturer or a media company. Basically we currently have the market of computer enthusiasts pretty much saturated with Volunteer Computing, but reaching sort of the average person who uses the computer as an appliance, that's the hard part. The next type of device, which is actually I think the most interesting, is GPUs. This picture shows why GPUs have a performance advantage. They don't have to devote a whole lot of transistors to caching and creating the illusion of random access main memory, the processor speed. So for example, the current NVIDEA chip does about 500 GigaFLOPS, half a TeraFLOP, that's maybe 50 or 100 times faster than the CPU on a typical PC that it's in. Programming GPUs to do scientific programming has become easier recently with the introduction of Kuda, videos of C-based environment for GPU programming and Apple has announced something called Open CL, that seems to have the same goals. So we could get an ExaFLOP, 10 to the 18th flops per second, if we had four million GPUs and each one did a TeraFLOP, which is going to be the case a year or two from now, and we had 25% availability running a quarter of the time. There is an ExaFLOP and that could conceivably happen in three or four years. Another type of resource is video game consoles, which are becoming faster as people want to have more realistic games and so forth. These could potentially provide nontrivial computing power, but they're inconvenient for a couple reasons. They can be hard to program, in the case of the Sony PlayStation 3. And in general they tend to have closed environments that are hard to get your program deployed on. So my crew calculation is that they could potentially give us a quarter of an ExaFLOP in a few years. Another kind of device that people have been thinking about recently because of energy efficiency issues is mobile devices, things like cell phones and PDAs, media players, the Amazon Kindle. Currently these are sort of discreet categories, but internally they're converging to the same sort of hardware and the processors in these are designed for low energy consumption. So in terms of FLOPS per watt, these are the best thing going. I should say we are thinking about -- one can consider using these while they're recharging. You wouldn't want to have your battery being zapped by running a scientific program while the thing is in your pocket. The software environment is problematic. Things like Google's Android, the proposed Open Source environment for cell phones, is a possibility, though currently that requires doing everything in Java. And with a lot of these there's currently over three billion cell phones and it will go up to five billion pretty soon. But they're so slow that actually even if you got all of them it's a lot les computing and then you could get out of GPUs. So it's an important idea, but I think that GPUs are more important. Similarly, appliances like home media players are moving to a full feature PC inside the box. So cable set-top boxes and blue ray players are typical of this category. The software environment is converging to a Java based platform. So again we have the problem of not being able to compile your Fortran program for it. So they could potentially give us a fraction of an ExaFLOP. Okay. So that's the summary of why I think Volunteer Computing is the quickest path to reaching the ExaFLOP milestone. Let me talk a little bit about the BOINC project, which I lead at Berkeley, which essentially provides an operating system for doing volunteer computing. We run a little tiny project, me and one of my half other programmers. Most of what we do is developing technology. We write software. And to fill a vacuum we also tried to enable a variety of online communities related to volunteer computing. So we run a bunch of e-mail lists and message boards for people who do things like writing translations or providing customer service, customer support for users, doing testing. The task of testing our software on all the popular platforms in the world is way more than we can do ourselves. We have a lot of volunteers to do it. Let me just kind of quickly describe what BOINC is, what the software itself is, and some of our development efforts these days. There's two halves to the BOINC software, the server part and the client part. The server part consists of a job processing mechanism and the key component here is the scheduler. It's basically a batch curing system that has to have extremely high capacity. Many of these projects have hundreds of thousands of clients and they need to be able to handle hundreds of requests per second and issue them a single job or maybe several jobs. So out of a database that at a given point may have a million or so jobs in it the key aspect of the architecture is that the schedulers are insulated from the database and there's a cache of jobs in shared memory, which is replenished by a separate program and the schedulers, instead of having to go to the database and find jobs that are appropriate for that particular client, can just look in the cache and that works really well. And the BOINC scheduler is able to dispatch in the order of 10 million jobs a day, which is very important. The other part of the BOINC server software is a whole bunch of PHP code that provides a website for the Volunteer Computing project. And this is real important because one of the things that brings people in to volunteer their computer and that keeps them going year after year are community -- well, competition mechanisms and community features of various sorts. So people like to keep track of how much work their computer has done, compete with other people, form teams, compete among teams, talk with other people about the science that's going on and all sorts of things. And these are really critical functions. We've tried to make it really easy to set up a computing project using BOINC so you can either set up a Linux box and install the BOINC software on it and port your application to our API, which is very easy. And in a day or so you can have a project up and running. Even easier than that, we've created a VMware virtual machine image that has all the BOINC software already there and everything that it depends on, like the proper versions of My Sequel and PHP and so forth and you can run that virtual machine on any computer you want and get things going even faster. If you want to avoid even worrying about hardware it's generally the server that you run the stuff on you want it to be highly available and scaleable and so forth. We're working on developing a virtual machine image for the Amazon Elastic Computing Cloud so that you won't even have to worry about hardware anymore. So we're trying to really reduce the barriers to entry for Volunteer Computing as low as we possibly can. Client software -- well, first of all it runs on all popular computing platforms. It's really designed so that a nontechnical computer owner can install it with one click, not do any configuration at all, and have it work indefinitely with no intervention from the user. There's also of course big category of users who want to configure things and want to have a lot of knobs to turn and so forth. The actual internal structure of the software is a bit involved. It looks to the user as a single entity, but the programs that make up the client software, this picture show what is they are. The central program is what we call the core client. It's in charge of doing all network communications, talking to servers, getting jobs, downloading files. It's in charge of doing CPU scheduling, deciding when to run applications. The different pieces, the Gouie(phonetic), while there's a -- what we call the BOINC manager, which shows you kind of a spreadsheet style picture of what's going on and if you want there is also a screen saver, both of these can run application graphics. So an application actually consists of two parts, the part that does the scientific computing and if it wants it can have a separate piece that does graphics and talks through shared memory so you can see the current state of the computation. The interfaces between these, the Gouie(phonetic) controls the core client through RPCs over a TCB connection, so in fact you can use the Gouie(phonetic) to control clients on remote hosts. Now the bigger picture, if you install the BOINC client software on a PC initially it doesn't do anything. You have to then do what we call attach the client to whatever set of projects you want. So like I say, there's on the order of like 50 projects out there and we want people to go out and read their websites, learn the science that they're doing, decide which ones they think are important and then you can attach your computer to any subset of the projects. An attachment has an associated weight, which says how much of your resources you want to go to the different projects. So you could spent 80% of your time studying the climate and 20% doing some sort of biomedical stuff or whatever you want. The key idea is that these projects are completely independent of each other. There's no central BOINC authority. There's not even an official listing of all these projects anywhere. Each one has its own server. They're identified simply by the URL of their website. Now the goal of this model is promote kind of an ecosystem where new projects are constantly arising and disappearing and volunteers are constantly learning about new projects and assessing their priorities. In practice this is kind of difficult because the only way that people have for finding out about new projects is by Googling or word of mouth or something like that. We've developed a framework that sports intermediate websites called account managers, so the idea is that an account manager provides sort of one-stop shopping for Volunteer Computing. It's a website where you can go and look at all the available projects, summarized in some way and their research described. And you can attach to them just by clicking checkboxes as opposed to having -to go out and survey a whole bunch of separate websites. To make this possible, the BOINC server software provides a set of web services so that account managers can create accounts on the different projects and look up stuff, manipulate stuff. There's currently two of these. One is called Rude Republic, the other is called the BOINC Account Manager. Some of the sort of technical work we've been doing recently has -- it's become clear that sporting GPUs is very important and also sporting multi thread applications that can use multiple cores is important. We've had to kind of revamp the internal architecture of BOINC. Originally the idea is that a client talks to a server and it tells the server what its platform is or it can actually give it a list of platforms so it may be able to run both Win 64 and Win 32 applications. The server has a bunch of jobs. Each job is associated with an application, not with a platform or version or anything like that. The server goes through and finds the -- whatever it thinks that the best platform would be and sends the client that particular application. Well, with the advent of multi threaded and coprocessor applications we needed to generalize this so in the current architecture a given platform, like Win 32, may have a whole set of different application versions that run in that platform. So there could be one which is optimized for sequential processing, one that is a multi-core version, one that uses coded GPU, maybe another one that uses both Multi-core and Kuda. Now the problem, of course, is figuring out for a given client which of these different alternate application versions is best. And instead of embodying that intelligence in BOINC itself, we came up with architecture where the project supplies a sort of a plug-in function that goes into the scheduler, which takes this input, a description of that particular host, it's CPUs, it's co-processors, everything else about it is hardware, and goes through these different available versions and for each one decides how many cores will that application be able to use in that machine? You know, how many co-processor instances? How many flops does it expect the application to get running in that machine? And then we can use that information first of all to pick whatever the best version is. And secondly, to get an estimate for how long a job is going to take to run there, which is critical to make a good scheduling decision. Just kind of jumping around here, another thing we've been working on here is improving our replication algorithms. So one interesting thing about Volunteer Computing is you're using computers that are essentially anonymous and you can't trust them. You can run a program on them, you get back an answer. You can't even be sure that what you get back is a result of running your program. There are a few bad apples, who will intentionally send you back wrong stuff or output of a previous run of the program or something like that. In addition, a lot of -- when you're dealing with this number of computers a lot of them have hardware problems, especially those that are overclocked. They can get floating point errors that don't crash the system, but they give you bad answers for the scientific program. So one way to deal with this is to do replicated computing. Take the same job, run on two different computers, compare the answer and accept it only if they're the same. It's actually trickier than that because different computers actually don't do floating point math the same and if you run the same program on an Intel on an AMD processor you can get back wildly different answers, especially for unstable computations. Anyway, so replication, we've worked it out and it's a way to increase the trust in your results to whatever level you want, but it wastes computing power. So one thing we're working on right now is a more intelligent system where do replication only some of the time. And if we're sending jobs to a host that has built up trust, then with a certain probability don't replicate that particular job. So the policy of the rule here right now maintains an estimate of the kind of error rate for a host. And if we have -- we're sending a job to a host where -- whose error rate is above that we always replicate, otherwise we replicate with a probability proportional to the error rate. So the idea is that the host earns a certain good reputation and after that we don't always replicate, but every now and then we mix in replication. It's a little bit unclear whether there are counter strategies that would defeat this scheme and if anybody is into this kind of stuff, I would enjoy talking to you. BOINC involves some very intricate scheduling policies. There's actually two interacting schedulers in BOINC, one that runs in the client and the other running in the server. The client has to decide when to get new work and what project to get it from and how much to ask for and at any given point it has to decide how to schedule the CPUs among existing jobs. This can be kind of tricky when you have computers that are disconnected from the network some of the time. You need to prefetch enough work to keep them busy while they're not connected. If you fetched too much work you can miss deadlines. You can in the presence of replication, you can cause other replicas to sort of get delayed for a long time until you can validate them. So similarly there are a bunch of server scheduling problems can be very hard. Until now we've mostly experimented with scheduling policies by coming up with something that sounded feasible and deploying it on a running project. The problem with this is that if you -- if you make mistakes you can waste a lot of computing time and get a lot of people angry at you. And the other problem is that there's a lot of factors going on at once and you can't really be sure that the result you see is because of the change that you made. So we're developing two simulators, essentially one to study the client and one to study the server. So for example the server simulator consists of a program that emulates a huge number, like hundreds of thousands or millions of clients, and it models things like their error rate, the churn rate, the process of people joining and leaving the project, the distribution of speeds of the computers and so forth. And that actually plugs into a real BOINC server, so to maximize the accuracy of the simulation we don't simulate the server, we use a real-live server with the actual programs and the database behind it and that actually runs fast enough that you can simulate it about 100 times realtime. We also do a lot of work on the community and competition, the volunteer facing features of BOINC. And in the past year we've added a bunch of features that sort of emulate the -- what you might find in a typical social networking website. The ability for people to make lists of friends and have people of each other and send messages around. A lot of stuff related to teams. Teams turn out to be a surprisingly powerful mechanism to motivate people in Volunteer Computing and we've sort of souped up the idea of teams so that there can be structure within a team. You can have administrators, kind of lieutenants, as well as the master of the team, the ability for a team to have its own private message board and things like that. We're also figuring out ways to introduce volunteer computing into the big social networking sites, like having applications that Facebook applications so that people can see when somebody has joined a new volunteer computing project or when somebody's total amount of credit has passed a milestone that will show up on their list of events. And to make it -- yeah? >> Question: Can you describe how a team knows that it's beaten another team? >> Dr. David Anderson: Well, there's elaborate features for a list of teams and either they're total credit or they're recent average credit. You can filter teams by holy -- you know, certain countries or by company teams and university teams. You can group things in various ways. The other general mechanism is that in addition -- we don't really want to -- people to think in terms of individual projects. We want to think of their totals across all of the BOINC projects, so we have a fairly elaborate system where the projects export all of their credit statistics in XML files and these third-party websites aggregate that data and show various forms of competition on leader boards, summed over all the blank projects. So like I say, currently there are about 50 BOINC-based projects. This number is embarrassingly small. I hoped that there would be a thousand by this point. One reason for this is that even though we've made it super easy to create a BOINC project, it's still beyond the capabilities of the average computational scientists. People who do scientific computing are not computer whizzes, they're not assist admins. In many cases they're not actually programmers. An idea of volunteer computing projects being operated by a single scientist or a single research group, I think we've sort of hit the limit of how far that model will go because there's not that many research groups that have the resources to do something like this. There's a bunch of other organizational models. My favorite is what I call the campus level meta-project. So the idea and this is being deployed at University of Houston. Hopefully that will inspire other people to do it. Is a volunteer computing project that is operated at the level of a university. It handles applications from all of the scientists at that university. The servers are operated by sort of a central group, the website and so forth. And it's promoted -- well, first of all, the computers of the university itself, like the lab machines and so forth would be configured to run that project to support the university's own research -- would be promoted to the university students so University of Houston, for example, is 40,000 undergrad and grad students and hopefully a lot of those would want to support university research on their machines. It also has 400,000 alumni and this is the interesting number. Alumni have school spirit that manifests itself in going to football games and sending in money every year. It seems like it would be pretty straightforward to get these people to run Volunteer Computing for the university on their computers, most of them own computers. So I think that is kind of an ideal way to do things and I'm trying to get some universities interested in that. There's a few other organizational models. There's a project called MyModeling.org, which essentially is centered around a particular application. These are people who build giant list programs that model the human brain and all of those different components. And there's one of these programs that's called Acter and there's a lot of researchers that use this and that community has come together to create a volunteer computing project that just runs that one application and any of those scientists can feed jobs into it. IBM Word Community Grid is another metaproject or an umbrella project operated by IBM more or less as a PR exercise and it hosts a number of applications from a variety of universities all around the world and there's a few other things going on. In Spain there's a province called Extremadura where essentially all the universities and research labs have decided to form a volunteer computing project that spans all of them. Okay. Let me kind of shift gears here and talk about my -- a recent interest of mine, which is related to volunteer computing in the sense that we're trying to use the public to help scientific research. The idea here is to use the people themselves and their intelligence, their cognitive abilities, their knowledge, rather than their computing power. So a couple of years I got involved in a project at the space sciences lab where I worked called Stardust@home, that had to do with finding particle of interstellar dust in a chunk of aero-gel, this very odd material, that had been sent into space and had collected a certain amount of commentary and interstellar dust and was parachuted back to earth. The problem was that nobody knew where the dust particles were or how many there were and they didn't even really know exactly what they were going to look like. So it was impossible to -- well, we thought about attacking this problem with computer vision, image processing and didn't work. So instead we set up a system where we would train people on what we thought these dust tracks would look like, so they would look at microphotographs of this aero-gel -- actually not just single photographs, but through a stack of photographs at different focal planes. There's a knob where you can turn a focus knob and what you expect to see is a little tunnel that goes through the aero-gel and a little cave down at the end. And you can't actually see the dust particle but you can sort of tell where it is. And this was wildly successful. We got 23,000 volunteers who on average looked at 1600 of these focus movies, each one takes maybe 30 or 60 seconds to look at. So they contributed a lot of time to that, there's a lot of enthusiasm. And we were able to calibrate exactly how well they were doing by introducing a certain random fraction of jobs that we had synthetically created. So either images that we knew didn't have a dust particle or image where is we had photo shopped in a dust particle with a certain size and orientation. So 20% of the jobs are these calibration jobs and we were able to really quantify how well the volunteers were performing, which in the end turned out to be much better than individuals or graduate students could do. And there have been a few other of these projects, which my word for it is distributed thinking. Some people call it crowd sourcing. There's a project called Galaxy Zoo, where people look at deep space images and identify different types of galaxies. One interesting one from University of Washington is called Forgit(phonetic) and the goal here is to take a complex protein molecule and sort of fiddle around with it to reduce its potential energy. So this is what it looks like. It kind of shows you the places in the molecule where there is an important space that needs to be filled in. That is the yellow balloons or where atoms are too close together, those are the red balloons. You can sort of grab pieces and try to reduce the energy. And this is Forgit(phonetic) is deployed as a multi-player online game so you can see a bunch of other people who are fiddling with the same molecule at the same time and sort of race with them to try and get the energy down. This task of reducing -- of finding the low energy state of complex molecules turns out to be one of these thing where is humans, some humans, can do better than computers. So I started this project about six months ago of -- I guess I like to write middleware so I decided to write a middleware platform for distributed thinking to make it easy for scientists to make new projects like Stardust@home and Forgit(phonetic). And probably the central part of my service, the system I'm working on, is support for learning about your volunteer population. You know, basically figuring out who the savants are. There's going to be some fraction of volunteers who are so bad at the task that the best you can do is to ignore their contributions and probably a few people actually try to undermine your experiment. And the other thing I'm working on these days is a platform to make it easy to teach people in the context of distributed thinking and Volunteer Computing. Both of these areas have the property that they have this giant pool of volunteers, hundreds of thousands or millions of people from all over the world, all different ages, education levels, interest, backgrounds and so forth, so extremely diverse. And you have more or less a study flux of new volunteers, typically several hundred or maybe a thousand new volunteers everyday. And both for distributed thinking and volunteer computing, it's useful to try and teach these people something. Either teach them about the science that you're doing to get them more involved or in the case of distributed thinking to train them to do one of these applications that could actually require a lot of knowledge, like fiddling with proteins. So the -- so you have this great diversity of the student population and because you have a steady flux of students there is the opportunity to actually create experiments. So let's say you have alternative lessons that teach a given concept you could set up an experiment that runs these two lessons side by side and it has an exercise after that. And at the end of a day or two you would have collected enough data to give you some statistically significant information about which one of the lessons is better or maybe one of the lessons works well for some subset of your population, you know, like the older males or something like that. So that's the basic idea of this other system I'm working on called Bolt. Makes it easy to set up experiments and to have training that constantly evolves to teach more and more effectively and can also be adaptive. Okay. So I guess that's -- I've reached the end here. So Volunteer Computing has done some repressive things in the past year. This PetaFLOP barrier, the possibility of moving towards ExaFLOP, but as far as I'm concerned it's really just barely achieving the tiniest part of its potential and both in terms of reaching a bigger voluntary population by one or two orders of magnitude and bringing in more scientists. Volunteer Computing is not been embraced by the high-performance computing community or even the computer science community. You can go to the super computing conference and you will not hear one single reference to Volunteer Computing, unless I happen to be giving a talk there, which is rare. This idea of distributed thinking, I think is real interesting and has -- it is at extreme infancy. We need to get scientists thinking about how they could potentially use that and these two things could actually potentially link together. You can imagine workflows where distributed thinking processes data in some way that then feeds into a computing system, possibly a volunteer computing system. So if anybody is interested in either of these things, here's my e-mail address. I'm real eager to think about starting new projects or to do research in either one of these things. So thank you very much and I can answer any questions that anybody might have. (applause) >> Question: I have a question. Are there names with co-results of the projects used actually during computing? I mean, if I have some computations to make, I don't cancel. I can't support a cluster myself and I decide to go for Volunteer Computing. Are there current projects out there right now running on volunteer computing? Are there significant results, things out there you could (inaudible) ->> Dr. David Anderson: Well, yeah, a lot of them. Volunteer Computing has an image problem, which is that most people think of it and most people equate it with settee at home. And they say, settee hasn't found ET, therefore, Volunteer Computing is scientifically worthless. So to try and combat that I assembled a list of publications of all the projects that use BOINC, results that were possible only because of Volunteer Computing. And there is a good link number. There is probably four or five papers in nature and science and PNAS and a total of 50 or so papers. Folding@home has a huge number of papers, all of which were enabled by Volunteer Computing. So there hasn't been the one big discovery, you know. Nobody has discovered the cure for cancer or extra terrestrial life so far, but there is a steady stream of kind of standard-sized scientific results. It's the same as any other computing resource, it's just cycles. >> Question: Has anyone done anything with Flash or one of the other kind of applications with a Web Browser where people don't install anything, it's just while they're sitting at this web page -- computers doing something useful? >> Dr. David Anderson: Yeah, there have been a few efforts in that direction. There's a very -- really one using Java applets. Like I say, most of the projects that use BOINC have some existing application which is if you're lucky it's in C-plus-plus. More often it's in Fortran, so pragmatically that's sort of a limited approach. Yeah? >> Question: (Inaudible) sort of the scientist says like (inaudible) do they have to adapt to (inaudible) or there two kinds of libraries that will (inaudible) framework that they use or how do they (inaudible)? >> Dr. David Anderson: There's different options depending on how much work you want to do. The easiest option is to use what's called the wrapper, which -where you don't have to change your program at all. In fact, you don't even have to have source code for it. And it runs kind of inside this wrapper that manages the communication with the BOINC client. If you want to -- for a real application typically you have to do checkpointing and there's -- BOINC has a very small set of APIs that essentially communicate with a client to tell when the aps should checkpoint and to acknowledge when the checkpoint is finished. If you -- if you want to do graphics, you know, to have your aps show something in the screensaver, there's some APIs for that. So -- but they're all very easy to use and there's Fortran, as well as CAC-plus-plus bondings. Some people, you can also run Java applications under blanks, some people are doing that. >> Question: (Inaudible) -- presence like how often like the growth rate this year? Like Seti@home has been used I call (inaudible) to use it as (inaudible) project like how much the user actually (inaudible)? >> Dr. David Anderson: Yeah, we have -- we've collected -- there's a bunch of different usage data that one can imagine. There's the churn rate, the distribution of time that people actually participate. We studied that a little bit. There's a curve which is pretty much what you would expect. We've studied the availability, the sort of -- some people leave their computers on 24 hours a day and BOINC is able to compute the whole time. Other people have BOINC set up to compute only when they're not using the computer so there's these different pieces of time. We've instrumented the BOINC kernel to collect all that data and to log it and we're able to send it back to the server. And we do that with Seti@home I think about six months ago. I generated usage data for about 100,000 computers and did some analysis of that. If you're interested in that I'd be happy to supply any of that data. As far as the fraction of malicious or malfunctioning computers, we could probably reconstruct that, but we haven't done that so far. >> Question: I mean, a rough approximation of how much cheating is going on, order of magnitude >> Dr. David Anderson: Order of magnitude is probably like two or three people out of a million. It's really small, but if the others learn that people are successfully cheating they get demoralized and it becomes known quickly. We added a lot of functionality in BOINC to detect and defeat cheating early on. The pre-BOINC version of Setting@home did not have any of those checks and there was rampant cheating, which we had to deal with after the fact. >> Question: So with the Bossat(phonetic) project, have you thought about or are you just (inaudible) it sounds like it's exclusively for certain website. You go to some website and you can evaluate the photograph or whatever. Have you thought about what it would mean to host chat in a separate computation making that framework open so that if you got a cell or whatever that someone has already got a bunch of A to N, you could have them use their big human brain and submit results over just the framework, like OSUI looks. >> Dr. David Anderson: I haven't thought of that, but should be pretty easy to do. If you have any ideas for what -- for a good potential project, let's figure out how to do that. >> Question: (Inaudible) possibly use it all having this (inaudible). >> Dr. David Anderson: Bossat(phonetic) hasn't actually been used for anything so far. It only started working about a month ago. The two pilot projects just so you know, they're both image analysis, they're kind of interesting. One of them is looking for hominid fossils, so when people like Louis Leaky and so forth, go out and try to find fossils in Africa they do it by finding areas where erosion has exposed fossil-bearing earth and then they just kind of walk around looking down at the ground really closely and it's truing the rocks and pebbles and every now and then you -- there will be a tooth mixed in there and you can't see it unless you like stare right at it and these people become extremely good at recognizing bone fragments. But it's very hard to cover a lot of area that way and when you do cover it you trample stuff and you potentially break things. So this guy named Tim White at Berkeley who's the new Leaky, we're working on a project where we'll get photographs of large areas of these fossil-bearing regions. The original plan was to have an unmanned airplane sort of crisscross the thing. I think in the end we're going to have a human carried sort of raft that has a bunch of cameras on a kind of a frame and a GPS device so you walk along and it automatically clicks. And then we'll have these people look at these pictures on the web and try to find fossils. That's one. The other project involves annotating satellite images of Africa. There's this giant amount of data of satellite pictures, but there's no information about like where the roads, where the settlements, the crops, things like this. So we're setting up a project where people can annotate those images and then scientists can learn something from them. It would be nice to think of a project that didn't involve images. Something that involved, you know, some sort of AI knowledge kind of thing. Anything else? Thank you. (applause)

am not Dan Fay. Dan is the host of... family situation, not an emergency, but had to take care... >>:

Products

Support

am not Dan Fay. Dan is the host of... family situation, not an emergency, but had to take care... &gt;&gt;:

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib

am not Dan Fay. Dan is the host of... family situation, not an emergency, but had to take care... >>: