15854 >> Savas Parastatidis: So hi everyone. I'm Savas...

advertisement
15854
>> Savas Parastatidis: So hi everyone. I'm Savas Parastatidis from Extended Research, and my great,
great pleasure to present Paul Watson from Newcastle University as it's now called, from the UK.
I know Paul, I've known him for, as we said yesterday, for 13 years now. Since '95, '96, '95. And Paul was
my Ph.D. supervisor, my mentor and then my boss at the UK Science Program, UK Science Center in
Newcastle. And he's been fantastic. He's been a fantastic mentor. I've learned so much next to him, and
it's always a pleasure seeing him and talking with him.
So I'm not going to say stories. I just want to allow Paul to start. He's going to talk to us about the
wonderful work that they are doing around cloud services for e-science.
>> Paul Watson: So it's a pleasure to be here again. So I'm going to talk about a project called the
CARMEN project, which is one of the big UK science projects, which as you probably know Tony kicked off
the UK science program and ran it very successfully for many years.
Every now and again they find some large projects where they see if we can apply e-science techniques in
particular areas.
This is one of them. So split into two strands really. So half is going to be about this particular new
infomatics application. I should say that I'm not a neuroscientist, in case somebody starts asking me
difficult questions. I'm a computer scientist. I pick up where I can from the neuroscientist in the project.
And the particular approach we take in this area is the cloud computing approach. When we started we
didn't know it was called cloud computing, but that seems to be what everybody calls it these days. We'll
talk about how we're dealing with the problems of new infomatics by applying a cloud computing paradigm.
So the reason we got interested in this was because although I don't know much about neuroscience I've
always been fascinated by the way in which the brain worked.
So I've been chatting to neuroscientists for many years. And we, all of us in the project believe that this is
the last great infomatics challenge. Physicists will always be keen on finding something about smaller
particles than the ones that you understand at the moment. But in terms of human size endeavor, they're
understanding more about how the brain works, I think, is really important.
When I started on this, I went and bought a book on neuroscientists so you can go to any textbook around
it and any textbook around the university and buy books and they're about so thick. You sit down and you
start to read them. And after about 100 pages your own brain is starting to hurt very much. And you're
starting to flick through faster and faster to get to the chapter called how the brain works. You never find it.
One of the neuroscientists in the project told me that's why we need projects like this one because we want
to be able to write that chapter. So one of the big hopes of this project is that by pursuing this sort of work
somebody can write that chapter on how the brain works.
So our hope is we'll learn something about computer science, because obviously there's many things that
the brain can do that computers aren't actually particularly good at. Like image recognition, for example.
Learn something about biology, and, of course, medicine. So a lot of the drugs that are used operate on
the brain.
So hopefully if we understand more about the brain we can make advances in all of these areas.
And here's a -- I think I have to click to kick this off. So this is to show you the sort of data that people are
collecting and operating on. So these are some ganglion cells they sit between the receptors in the optic
nerve. They're mediating information that's coming in through the retina.
And what you're seeing there is these dots, the colored dots, they're active neurons. You can see waves of
activity sweeping through this part of the brain. And this is from a retina of an unborn turtle. And so what
they think is happening here is so there's no actual light getting to the retina. So what they think is a lot of
the brain is very plastic.
And so what they think is that this is part of the process of wiring up the connections in the brain so it's
ready for when the turtle is born so it can interpret the world around it.
But they're not exactly sure that this is the case. So they collect data like this and they try to understand
what it really means. And this is a surprisingly large number of neuroscientists around the world collecting
this sort of data. So 100,000, which is much higher than I would have guessed.
And they collect information, you can see, at all sorts of different levels from the, right down to the genomic
right up to the behavioral.
And this creates a lot of problems. So at the moment, although there's lots of data collected by all these
different scientists, unfortunately it's very rarely shared. As you can imagine, a lot of this data is very
expensive to collect.
I'll show you some data in a few slides time that's collected from human patients and there's only one of the
operations that allows you to collect these data, takes place in the hospital. Newcastle every month. So
they collect this data. But at the moment it's not really particularly well reused. And the reasons for that, it
tends to be kept in a proprietary format. Every manufacturer of kit that you use to collect this data has their
own formats for the data.
So labs stick with a particular manufacturer and write all sorts of analysis routines that only work with that
manufacturer's data format. So it's actually very difficult, then, for data to be transferred from one group to
another. And people who write analysis routines they can't write generic analysis routines you can use on
all sorts of data. It tends to be specialized to the kit they have in their lab.
And as a result of this you get much less collaboration than you might expect. So neuroscientists are
aware of the fact that they have different skills, different expectations, are doing their experiments rather
than the analysis. So really if they can collaborate and share the data and share analysis routines they
could make quite a lot of progress but they're not able to do that at the moment.
Just put this in, because neuroscientists are very clever people in my experience. It's not just because they
don't really know what they're doing. This seems to me to be a general problem in science. So as you
move around, working on e-science projects in lots of different areas you keep seeing this again and again.
And came across Jeffrey Bouker (phonetic) once who talked about his standard scientific model. He says
science takes place in three stages. First of all, people collect data and then they publish papers on the
data. And then they gradually lose their original data.
And this wasn't just a (inaudible) if you go to this paper analyze the way in which data is collected and used
in science.
And this the creates all sorts of problems with -- so the idea of science, one of the great things about it is
supposed to be the ability to replicate experiments. But actually if you haven't got the data, and a lot of the
data we use now is created by programs rather than just the original source data.
If you haven't got this data, then how can you replicate these experiments? How can you reuse the data?
How can you take other people's data and see whether your analysis routines, when you apply it to their
data give you the same results as in your own data.
So this is really calling back not just neuroscience but just lots of sciences.
I would claim you get the same for codes in science. So codes go through the same stages. People write
codes to apply the data, they publish papers on it then gradually lose the original codes and it has the
same problems. So although there's various attempts in number of sciences to allow people to share data,
what we decided to do in the common project was also try to get them to share the analysis routines as
well. So we're trying to attack both problems.
And so CARMEN to the rescue, this is the mission statement of CARMEN to allow scientists wherever they
are to collaborate and share their analysis code their data and their expertise.
So this was what we put forward to committee people including Darren, in fact, who thankfully decided to
fund us. So it's quite a big project. The computer science system by Newcastle and Jim Austin's group at
York and there's neuroscientists around the country are collecting data and want to use this system to
share and analyze that data.
And there's a few companies including Microsoft who have signed up to support that, the project as well.
Within the project, the focus is on neural activity. So as I said there's lots of different sorts of data that
people are interested in. Neural activity is about, so they put one or more electrodes into the brain, and
then you get readings like this.
And because you can't put electrode into a specific neuron, then the information you're getting is from a
whole set of neurons around where the electrode is. So the first thing they have to do is what they call
spike sorting, where they try to work out the set of neurons which have contributed to this signal that you're
getting out from the electrode, various techniques to do that.
Then they have some way of working out where the spikes are, because although people aren't really sure
how the brain processes information, there's a lot of evidence that these spikes, the way in which
information is passed through the different regions of the brain and processed by the neurons.
And the idea is basically to try and understand what on earth these spikes mean. And that's a very difficult
problem. And the reason why there isn't a chapter at the moment on how the brain works is because
basically you're sampling. So because you can stick one electrode on nowadays an array of electrodes
into the brain, you're sampling from a set of neurons in one particular part of the brain.
You don't know what the connectivity of those neurons are. So you don't know what's connected to what.
You don't know whether you're getting information from all the neurons. You don't know whether the
information you're getting is specific to the particular place where you've inserted the electrodes or whether
it's more general.
So it's very difficult. Sometimes draw the analogy, imagine if we're trying to work, understand how
integrated circuits work and all we had was a probe that we could stick at various points into a various eye
circuit on the back of that we had to work out how a pentium processor worked. It's a problem of that sort
of magnitude.
So here's an example, just to show you about how data might be expensive to collect. And we really want
to share it. So this is from some work that's done at Newcastle in the hospital there. So every now and
again, so roughly 12 times a year, there's a patient presents himself at the hospital. They've got terrible
problems with epilepsy. It can't be controlled by drugs. So as a last resort they remove a part of the brain
in order to try and cure the problem, or at least to reduce it.
And in order to do that, they need to work at which part of the brain to remove. Obviously that's very
important that you remove the right part. No more or no less. So what they do is they open up the brain
and they use electrodes in order to gather information and they move the electrodes around until they've
located the part that they think causes the problem, and then they remove it.
And then they analyze the slice they've removed while it lives so they can oxygenate it and keep it in drugs
for somewhere between minutes and hours and gather some information. So they can do this once a
month and they collect the data, and then largely the group that collects it uses that data. But, as I've said,
the problem is it's not particularly shared. So it's not as if it's easy for another group somewhere else to
take that data and apply their own analysis routines to it to combine data from other hospitals.
So this is a warning. So I know some people don't like seeing medical photographs. So the next two slides
are going to show an exposed human brain. So look away now if you don't want to see it. I'm starting to
go to newer science conferences they don't give you a warning. Sometimes you can jump out of your seat.
Here we go. So here's a part of the brain where they think the problem is. So they've removed that part of
the skull to prepare it. And then you can see there this two sets of four electrons being placed on the brain
and the surgeon will move those around and look at the signals coming out. And when they think they've
located the area then they'll remove that part of the brain.
We're passed it now. So anybody not looking can look again. There's no more slides like that. And this is
information that they're getting out of the brain. And actually for the part that causes the problem with
epilepsy, it's usually what you get is you get a part where there's not really very much activity, but in the
surrounding areas there's lots of activity.
So it's almost as if a part of the brain which is causing the problem blocks the flow of signals around that
part of the brain. So you see them all building up around that area and a very dead area which is the part
that they remove.
Okay. Here's another one. So when it came to trying to work out what we needed to do to address this,
then the key thing here is to be able to share. So sharing data and code, as I've said, we're trying to share
the analysis routines as well as just the data that's collected from the experiments.
And the capacities can be quite large. So I've said vast. So I guess to a company like Microsoft 100
terabites is not very big, but it is compared to other e-science projects. So we've collected that from the
scientists within the project within the next couple of years.
One of the problems is that as these techniques come on, then you get more and more data collected. So
when we started the project, started writing the proposal, single electrode recording was quite common.
And then multiple electrode recording came in. So it moved from one electrode up to about 60 to 500 of
these electrodes that they stream from all the time. Then the video techniques I showed you, gives you a
factor of 10 more data.
So it went from 10s of megabites per experiment up to gigabytes or even into some of the bigger
experiments terabites in the single experiment in the course of two or three years. And this is only going to
increase as these video techniques start to take over.
And once you've got that sort of data, of course, you have to analyze it. And this can take a very long time.
I had someone in my office last week who estimated that their analysis routine would take half a year to run
on some of these data sets that we'd collected. So you need to share very large amounts of data with
these routine analysis running on it.
We came up with this cloud architecture for it. So the idea is if you've got a cloud somewhere out there
over the Internet where people can upload their data then it's easy for them to share it because they can
just change permissions on it and other people can see it. It doesn't have to be physically transferred.
And so once they've uploaded their data they can also upload services. So this was the idea. So it would
be nice to have a fixed set of services that would support the needs of the neuroscientist. We would
provide them out there in the cloud. They would update the data. They who choose what to do run the
data and visualize the results.
In practice, of course, being scientists what happens is they're driving into work on a morning and they
come up with a new algorithm and they want to call it up and apply it to all the data in the system.
We need to have a way so that users could not just upload data they could up load code to be willing to
analyze their data as well.
And so once you've got your data up there and the services are up there, then users can run analysis. So
they run the analysis out there somewhere just using a web browser and they can visualize their results.
So that's quite a common cloud sort of model these days. Perhaps the thing that is a bit different about it
they need to be able to download and upload services to run.
>>: What exactly are they uploading, when they upload this analysis, is it a VM?
>> Paul Watson: I'll talk a bit more about that later. But so we get people to write everything they do as
web services so we want to have some sort of standard for it. We can cope with having web services in
VM, but we can cope with it in other ways as well. War files, dot net assemblies. As long as we have
some sort of deployer for it then we can handle it. The only restriction we have it's got to be wrapped as a
web service so we've got some sort of common way to deal with it to put security around it to compose it
with other services.
And so there's quite a lot of interest as you know in cloud computing. So one way to approach to this is to
say we'll provide some low leverage storage with compute services. And this is the Amazon approach
where you can buy storage on S3 and you can run things in EC2. And then we could say, okay, then you
just write your science apps above that. And we didn't want to do that. What we wanted to do is to try to
think about, well, are there any services that we can factor out that we could provide once and then people
could just use them.
So people could move up the value chain, just write specific services at the top. So based on last five or
six years of running e-science projects we sat down and thought about well what were these common
services that we'd want to provide.
And so we rejected that option and basically ended up with this. So these are the services that we came
up with in the cloud. So this is basically our cloud here. And there's a set of services in there. We run it on
our own cluster at the moment. I'll talk later on about why we don't run it on Amazon or something. But we
have our own cluster to run it on.
So if I go through the services, so you won't be surprised that we need somewhere to store data. So the
real data goes into files, because it's basic time series data. And then we also have a database. In fact,
SQL Server that we can store secondary data that we produce from those files.
And then we have to have some way of describing these files, because the scientists have to be able to
look for data that was collected in those particular sorts of experiments by particular groups.
So we've got a meta data store in there. When scientists upload their data from their web browsers over
the far end over there, then they fill in forms to describe the meta data that we've got some ontologies that
we've devised we use some general ontologies from biology which is used to basically describe the
experimental process and then some specific ontologies that we've had to produce for electrophysiological
data because none existed before the project started.
So we end up with your data and your meta data here. And then -- I'll talk about this in more detail but
we've also got the repository for services that people have uploaded. And then we've got a cluster.
And when people want to run these services on the data we can take the services and deployable form
from this repository and deploy it on one or more nodes and run it in the cluster.
We've got work flow enactment engine because work flows have been successful in different science
projects not used by neuroscientists but our scientists say it's a valuable way for people to compose
together different analysis codes so people don't have to write a siloed routine which does everything
people are going to visualization can write visualization routines and compose it with somebody who has a
statistical analysis and compose it with somebody else who has spike routines. It allows people to combine
these processes together and share them.
And so we wanted to run that in the cloud as well, because obviously the danger is if you've got your work
flow enactment engine out as is commonly the case on the user's desktop then all the data would have to
fly backwards and forwards and it would take a long time if you were trying to analyze terabites of data.
>>: Work flow engineering?
>> Paul Watson: So at the moment we're using Taverner. So that's largely because historically this is the
one we've used and got the expertise with. But we're not completely tied to it. We could move to anything
else, as long as it would run in the cloud remotely.
>>: I recon you could ->> Paul Watson: Yes, we'd have to apply for permission, but I guess we could get that.
>>: (Inaudible) [laughter].
>> Paul Watson: Screen saver, I think.
So then we've got registry which allows people to search for the data in the analysis services that they're
interested in. And we've got a security wrapper for the system. So neuroscientists like lots of scientists
aren't the sort of people who are just happy with the idea of sharing all their data. Assumes it comes out of
the experiment. So they really want to get those nature and science papers written and published before
they allow anybody else to see it.
We've done lots of requirements analysis with all these different groups of neuroscientists that we've got
involved in the project. And it's quite interesting that the first reaction when you say and once the data is
uploaded who would you like to be able to see it. The initial reaction is always me. And then they think,
and maybe my students and perhaps later my collaborators and so on.
But really you need a quite a fine grained security system running there before anybody would consider
uploading the data into the cloud.
So we've got a system where once you've uploaded the data you can decide who is allowed to see it and
change it over time as you allow more and more people to have access to it.
And then we've got the people accessing this over the web on the left-hand side. Roger.
>>: Curious why did you not put promenade service on there so people could capture what they did.
>> Paul Watson: So that's a very good point. So we have actually got somebody working on a promenade
service but we didn't commit to it at the beginning of the project because we've never written one before.
We couldn't find one that was just available to drop in the system. But you're right that will be a very
valuable addition to these services.
>>: If somebody creates a service for repository that's used is it a problem or is it something that would
have to be built in.
>> Paul Watson: I think it would have to be built in, because it has to have probes into all the other things
going on. So every time you run a work flow you want it to record who ran it, what the work flow was,
which services are called and so on. So I think it would have to be one of the sort of first class services
rather than these analysis services here which are then choreographing using the work flows and
prominences would record what is actually happening.
Okay. So I'll say a little bit about one of these things, which actually picks up on the question I had earlier.
So this system that we've had for a few years now, which allows us to have this code repository. So this
thing called Dinosaur which actually Savas worked on in the early stages, but the basic idea is that we
wrap code as web services. So we stick to WSI web services because we want to have them to be
interoperable and we don't want to have to go back to everybody in a year or two time and change
whatever the new web service standard is. So WSI seems to be conservative and do what we want. We
stick to that.
That means the internals aren't important. People can write services in Java. A lot of neuroscientists write
in MAP lab and this proves to be quite a pain for us, because of the licensing model. So we had one
person technically looking at how we might run MAP lab in the cloud. Another person reading the small
print of all the MAP lab documentation to make sure we wouldn't be imprisoned or fined for doing this, but
we think we've cracked both of those.
So there's a MAP lab compiler you can use once you've produced the output of that you can run it without a
license. That's what we do so we can support MAP lab services as well as any others.
The key thing is whatever you've got, then you have a deployer for it. So once you've got this service in
your repository, you just need somewhere to deploy it. So we've done virtual machines. I think Savas did
this one many years ago. The virtual PC one. The .NET assemblies we can also deploy. And even Jim
Gray once suggested to us we should be able to deploy services in a database. So we've done that as
well. So we can deploy services right into SQL Server, as well, if that makes sense to having us saving
from transferring data from in and out of the database.
So this is the way it works in a bit more detail. So the first time you call a service, so basically all services
have end points, just ordinary service end points. Clients didn't see any difference from a normal service.
What happens is the client sends the sort message to this end point, which is owned by a web service
provider. And it then sends it to somewhere. It can be anywhere where we have these things called host
providers and they're in charge of the compute resources.
And it's not necessarily the case that the service has to be deployed at the time when it arrives at the host
provider. Because the web service provider, when it gets the request, it inserts into that request the handle
of a place where a host provider can go to get deployable version of the service and deploy it if necessary.
So that's what's going to happen in this case.
So this is for service S and S 4 is not deployed on any of these nodes so a request goes out to the service
repository. A deployable part of the service comes back and deployed on the node and the request can go
to that node and run and then the original will return to the client.
So basically it's a way so we can build up this repository of services.
>>: What is the node?
>> Paul Watson: So that would just be a processor. So in our system at the moment this is a cluster. So
these are each of the nodes in the cluster.
>>: What platforms are you offering, like you said you have .NET and Java, database. Are you offering
Linux and Windows and my SQL and SQL and all these kinds of things?
>> Paul Watson: The cluster at the moment is a Linux cluster. But the infrastructure, which supports
Dinosaur, could be ported to anything. Savas, a few years ago, got a part of it working on .NET quite over
a weekend. So it's always over a weekend.
>>: You're running Mono with ->> Paul Watson: No, that one was an implementation.
>>: But these are not -- this could be written so they support more ->> Paul Watson: No, not at the moment. We've not tried it with Mono. I guess we could.
>>: This is a Java machine?
>> Paul Watson: Yeah.
>>: I thought you said you were supporting .NET and (inaudible).
>> Paul Watson: We have done in the past. In the current system, then, we couldn't run those at the
moment. Except you could put them into a virtual machine and then we could deploy those and run them.
So that would work. But, yeah, at the moment the host provider. So this is the component that you would
have to port to another sort of machine. So at the moment this one is on a Linux platform in the CARMEN
project.
Once you've deployed the service, then for subsequent calls, then they just get routed to a deployed
version of the service. So there's no overhead every time you call the service. It just gets sent to that node
which is running the service.
So one of the attractions of this is that when we started to use virtual machines, things like dot war files or
.NET assemblies you might end up with 10, 20 megabytes representing your service. As soon as you go to
virtual machines, it goes up, can be five, six, seven gigabytes. It takes quite a long time to move these
onto a node and deploy it. But the attraction of this we're doing is that once it's deployed you don't have to
move it every time.
It can just stay there and if cost of deploying it gets aggregated across all of the different invocations that
you made on that service once it's there.
>>: Does this have any understanding of cost in the way that time to deploy an instance of something in
the event you do have to get removed from the house or ->> Paul Watson: Yeah, we've done some work on that. So after this I'll come back to that in one slide's
time, Darren. I thought it was this slide but it's actually the next one. So this just gives you an idea of some
costs.
So this was where we were investigating what it would be like if we didn't deploy the services close to the
data. This was the case where the column was just a repository, and people were running the routines on
their local system.
This is some tubules. Each of these tubules was a few hundred bites coming from this database then there
was analysis service. So this purple line shows what happens if the analysis service was remote from the
database. And you can see that for this particular analysis routine you pay a high price for transferring the
data. So this is logarithmic. So as the number of tubules requested goes up it's really starting to cost an
awful lot to move the data around.
With Dinosaur, now you move the analysis team close to the database. You can see that the cost drops
considerably. So this yellow line is the cost the first time you call the service using Dinosaur. So this
includes the cost of deployment.
Then after that this is the cost once it's deployed. So this would be the cost for the second and subsequent
calls of the service down there.
So you can see you do pay a cost in the deployment. But once you've done that, then the overhead's as it
turns out are quite lower of having Dinosaur on the system.
Right. So here's the answer to Darren's question. So we played around with various algorithms. This is
the most sophisticated one we've tried. So this is the one which tries to keep the response time constant
for a service. As you increase the number of requests.
So it does that by monitoring the time that it's taking to process each request. And then deploying multiple
versions of the service in order to give you more con concurrent services which you're able to process
requests of this type.
So if you start over here. So this is quite a low arrival rate and we've got two processors here which are
able to process the request and the response time's down here and response time as the arrival rate
increases the response time keeps steady because the system keeps deploying more and more services
on different nodes. You can see the number of services those square boxes so it goes up to two, four six,
so on. So it sort of works until you get to about 12 processors then after that as the arrival rate increases
we run out of nodes. We've only got 16 nodes and the system can't cope with this arrival rate beyond that.
So this is another attraction we think of having the services in deployable form and repository, because you
can deploy multiple instances of them in order to scale up the performance if a service becomes popular.
And, of course, you can then get rid of services if the arrival rate drops and there's been some work, some
people in the department interested in cueing theory and mathematical model have done some analysis
which when you should get rid of services as well because they might affect the more popular service.
You've got a balance, you've got a fixed set of nodes and set of services that people want to use and you
are trying to decide how to partition up the nodes across them.
>>: What is the percent -- still the response times increase?
>> Paul Watson: Well, once you get to here, then you just haven't got the power when you've deployed the
service on the 16 nodes to deal with all the requests as they come in. So at this point then you really have
to buy a bigger cluster in order to keep the response time at this level for that sort of arrival rate.
>>: How are you load balancing (inaudible) for example just issuing new end points to this request coming
in?
>> Paul Watson: This is just round robin load balancing for this experiment. So however many
deployments there are you just round robin the request as they come in across the cluster. It may be that
we could do something more sophisticated with that. In the 1980s I worked with a group that did a lot of
work on load balancing scheduling, and people would write Ph.D.s for, and study this for three or four years
and then discover that random was actually the place to approach the round robin. I've never really tried to
do anything more sophisticated in that area since.
Okay. So that's one bit of technology. I'll just go to end. We'll go through a typical scenario. So this is the
sort of thing that when we looked at the requirements from the neuroscience, this is the sort of thing they
wanted to do. So to start off, they collect from multi electrode array. So this is a multi electrode array. So
you saw the brain earlier when the slice was taken off. Replaced on top of there and each of these is
individual electrons and you can read out from them by connecting to these parts around here.
The neuroscientist would want to have a look at the data. They would want to make sure they were getting
some signals from it and they might move the slice around if they weren't getting a very good response
from it or they might try different drugs or whatever until they were getting something, which they thought
was worthy of analysis. Then they do the spike detection and sorting that I talked about earlier, how they
thought that the signal which you got was spread around the set of neurons.
They do some sort of statistical analysis to try to work out what the spikes that they'd identified meant. And
then they'd visualize the results. And this was a semi manual process which involved lots of, so we found
one group, where even for a small amount of data it took them about a day or to two to analyze it because
doing data format conversion between the data as it came out of the electrode array kit and the analysis
routine and they were putting everything through an XL spreadsheet in small chunks manually in order to
do the data conversion. So that was one thing that we were able to fix.
So if I go through these very quickly, these stages. So this is the -- so I said that this is about cloud
computing and the idea is to move everything in the cloud and everything uses a browser. This is the
exception that proves the rule. So this is a tool that Jim Austin's group at York built for the dam project
which was one of the projects that Tony funded looking at aircraft engines.
So this feature here, so in the classic example of this for aircraft engines, this would be a bird strike. So
they'd have a particular pattern they identified for a bird strike. They'd want to find all the occurrences of it
and all the data from the engines that they collected.
This is actually from one of the electrode arrays. You can see, these are each of the electrodes in the
array. These are the spikes here. Neuroscientist would have a look at this. They'd zoom along and have
a look through the data at various times and just make sure they were getting signals and they were happy
with it before they decided to analyze it.
So this is one tool -- so this is actually a Windows application, Windows application that runs on the
desktop. The people at York say it couldn't be put into the cloud because of the nature of the interaction
with the user and trying to get the data backwards and forwards between the cloud and the tool.
And this is what you can also do with this tool. So the Signal Data Explore which is the tool that I showed
at the top left. You can kick off searches which are sent close to the data. The data are stored in the
storage resource probes which is a file management system and you can search for particular patterns and
it moves the patterns as close as possible to the data to reduce the amount of information that's being sent
in and out of the systems.
The idea is only small amount of data should be sent between the cloud and the desktop tool. And so what
happens is you kick off these searches. And any time the client can request to have a look at some of the
positive results that have been sent back until the user comes in to browse through. Typically, in this case
they'll be looking for particular sorts of spike activity.
>>: This is multiple deployments of SR video.
>> Paul Watson: You'll have to talk to Savas about the exact way it works. So basically on each node -on each node you deploy a local SRB, which you can then send the requests to.
>>: My team has never seen anybody who has really used SRB. They disbelieve me.
>>: It's different SRB.
>>: Oh.
>>: There's SRB here too.
>> Paul Watson: Yeah. And so the people at York, Jim's an expert on different ways of searching.
They've got different ways of searching that are efficient and scale well across a set of nodes.
So we define the process that the users were doing with a work flow. Takes in the data at this analysis. So
this is written in R rather than MAP lab. R tends to be used by people who don't want to use MAP lab
because it's proprietary. But not everybody can use R to do what they want.
It turns out because MAP lab comes with lots of libraries which have things like signal processing which are
actually very useful for what the neuroscientist want to do. And you can't do everything with R. So in this
case the R code is actually doing some statistical analysis once you identified where the spikes are.
And it produces some output you can see at the bottom. So this is an early version of the portal, which
we're not user built experts. Interface design experts. But it gives you an idea what you can do. So you
can see the data that you have access to on the left hand panel over here and you can select some data.
So in this case the result of this experiment you can have a look at what work flows are available and you
can click and select to run a work flow. It runs in the cloud and then you get the results back.
So this is one of the graphs that's produced. This is actually showing the -- so the density of the black line
shows the ray of spiking activity at a particular electrode. You can see this is one of those waves which is
flowing across from one part of the electrode array across to another part.
And then they also like visualizations. Some of these are quite crude ones but effective. This one which
messed up on its own account, I can't remember, here it goes. This is for each electrode the radius of the
circle you can see the rate of it's firing neurons in that circle.
You can see the activity. You can see the flow of the activity across the slice quite well like that. It's much
easier than if you just got graphs or printouts of numbers.
Okay. So just to finish quickly. So CARMEN is aiming to deliver new results in neuroscience and hopefully
computer science, and if we can apply some of the results of the neuroscience in medicine as well. And
we're delivering through an e-science cloud that allow the folks talks beyond neuroscience, our hope what
are the common building blocks you'd like to have in the cloud. Our hope is we'll be able to use them for
other sorts of science as well.
So one of the, a new project which we have, which is funded -- so One Northeast is the Regional
Development Authority for the northeast of England. They've funded a pilot study looking at how can you
apply this approach to other sciences so we have money to go in look what scientists do, take their codes,
wrap them up with services, put them in the system and provide this sort of system to users.
I think I've got another -- so this is the vision here. So you've got your basic set of e science services at the
bottom. You've got some core scientific services that are independent of any particular science, and then
in particular demand, so I've picked these four demands over here because these are four that the
Regional Development Authority has said are going to be important for the northeast of England and
they're pumping money into that. They're also interested in building upsets of dad and services that can be
used for skills and education both in universities and also in schools in the area.
And there's no reason why you have to just stick with science services. So we've been looking at some
business services as well. So it's quite interesting that if you look at -- so the northeast isn't a high tech hot
bed in the UK. There's odd pockets of expertise.
But the companies are often small. And they struggle to get going. One of the things, one of these has
been trying to do is to make it easy for them to set up companies and run companies. And there's quite a
few people who will host business applications like these ones here, project management and CRM and so
on for a monthly cost for these small companies in the Regional Development Authority is keen to
subsidize and encourage people to do that to get off the ground without big investment.
So the idea is to make these available as well through the same sort of platform. So one of the users in a
start-up or in a university would log onto the system and be presented with a set of services that get
narrowed down to that particular scientific area looking at the business area where they could run them.
Hopefully they can do a lot of work within this sort of environment without having to invest in people or
software.
Because it seems like the up front cost of creating a start-up in a science like bioscience can be quite large
and hopefully this is moving more towards the pay-as-you-go type of approach could reduce those costs.
I think I've got one more. So this was the hope that that core, that's the hope from the Regional
Development Authority. They think they could sell it to businesses that could come into the area and get all
the software they wanted for a low cost.
They also are interested, because we're adding social networking features into it. They're interested
because you've got small companies how to build up these relationships, how do you build a community,
and again, you can do that with that.
And very last slide to move on to. For a company like, such as this one, that's looking at commercial
clouds and so on, then this could address a real problem that there is in science, which is sustainability. So
it's a real worry for us and our users that we've got four years (inaudible) and they've put a lot of effort into
writing the applications and uploading the data and then what happens after the four years? We've really
got no guarantees because we're a university we're really pitching to do research rather than to create
these systems and maintain over a long period of time.
So we've been looking at commercial cloud. So, for example, the one from Amazon at the moment and
whether we could use that. And the attraction of that is might save us from having to provide our own
storage and compute so we wouldn't have to run our own cluster.
And the idea would be to find some way of moving from our own private cluster over to use Amazon
facilities, perhaps in a controlled way. And there's a local company who (inaudible) could actually worked
for as well we've been doing some work with and they're looking at this area how you move from your
private internal cluster over to external resources like Amazon. So that might be a way over time to migrate
off of our own system to something else.
But it still gives this problem, because if you look back at that stack that I told you about earlier. So we've
got the storage and compute needs which are commercial cloud like Amazon could provide. But then this
still leaves all of the services in the middle here which we have to build and support and maintain, if these
users at the top were able to use ->>: Why can't you factor that as a VM that runs on those commercial sites?
>> Paul Watson: You could do that, but of course, it would still need some maintenance and you'd still
have to upgrade it and so on.
>>: (Inaudible) however upgrading would be -- a version (inaudible) because their science project is still
there.
>> Paul Watson: I think that's a good idea. And I think that's the way that you would organize this. But
software seems to atrophy over a period of time, if you just leave it. And there's always security patches
that need to be applied or somebody wants to just add in this extra feature or my feeling is if we just
effectively froze in virtual machines what was around at the time then these users would find that over time
it would meet less and less of their needs and they'd have to move to something else.
So my feeling is that you need some people are responsible for maintaining and supporting it at this level.
The security services, the work flow services and so on. And then the scientists can concentrate on their
own services at the top. So that's where we see the gap at the moment.
So that's the question at the moment, provide how do we support these e-science services. So we've been
looking at whether a company could do this or we've started to get for the first time in my career we've
started to get involved in thinking about whether we should look to have a company which would actually
try to provide this, which you would look at the full range of things from here to here. But even then, as
Roger points out, then the more that commercial companies could do in this area here it would really
reduce the amount of effort you would have to do up here in order to provide services for users over a
period of time.
So that's a company that we're setting up.
>>: How does it relate to design as far as -- design cost and the company delivering optimization services,
cloud services?
>> Paul Watson: I didn't know about Simon's work in that area so we should talk to him. Okay. And that's
basically it.
(Applause)
>>: Savas Parastatidis: People have been asking questions throughout. Any more questions?
>>: So even if you have a company that you set up, the mechanics, the problem you have is does the UK
university essentially feel that this is something, this is infrastructure relation (inaudible) because that's the
fundamental it seems to be the research (inaudible) across the infrastructure on a project basis. It's bricks
and mortar, which in case that's the university program. So I mean basically really is there a shift in attitude
coming in the wind as to what is being there for research funding?
>> Paul Watson: So I don't think that there is, from what I've seen. I think there's pressure from funders
now that say that people must make publicly available the results of their projects. But it's not at all clear
how you do that. There isn't a group out there which could take your data and your services.
I mean at the moment a lot of the focus is on data. But really data on its own is pretty useless. We could
give at the end of the project 100 terabites to somebody you could put it on a disk but it wouldn't be useful.
Nobody could afford it to download it to analyze it.
So you really need a combination of the two. We've not seen anything that's there which is looking to do
that.
>>: So these are very profound questions. What about the institutions like the National Grid Service and
you can ->> Paul Watson: So a few years ago I suggested that ->>: (Inaudible).
>>: Yeah.
>> Paul Watson: So a few years ago I suggested that what the National Grid Service should move into
would be something which, took the phrase that Google used to describe themselves, organizing the
world's information. I said why doesn't the National Grid Service move to be something which organized
the UK science information.
So it would take not just the codes which the National Grid Service has the capability to do, the OMII, for
those not familiar with the odd UK terminology this is a group which Tony set up which is to look ->>: I was on its board.
>> Paul Watson: Okay. So there's three people who know it very well.
>>: Pick your words carefully.
>> Paul Watson: This fantastic system. So that was set up to allow people to, take services that were
produced, code that was produced by UK science project and it would get it to the point where other people
could download it and install it. And maintain it and produce new versions and to documentation and so
on. It did a good job of that.
One side you have the National Grid Service that would run things and the other side you would have OMII
that would produce software that you could download and install. But really I felt that what's needed now is
bringing all this together so there's somewhere where you can have in one place you can have your data.
You can have the services that some group which will look after those services so that nobody needs to
download and install it anymore.
That was the attraction of really moving into the cloud model. It seemed that was the way that the
computer world was going. You can imagine in that model if you could get some people who would provide
these basic services and get the scientists to concentrate on their own specific services up at the top.
But I have not seen anything like that that exists at the moment. That's why we wondered about having to
go ourselves to see whether -- so the first part of this project was one north east will be going around
scientists at the university and some more companies and seeing whether this sort of cloud model really
works for them or not.
If it does, then hopefully we'll be able to work out how to provide on a commercial basis, how much does it
cost to actually -- so the problem with Amazon is is that people focus on just the low level costs of cycles
and storage. And they don't concentrate on the costs of maintaining the software above it.
So I'm hoping this will give us a better idea of how much that would actually cost to provide this sort of
system to the users.
>>: Just 1 story and then one comment. I went to see the (inaudible) in UC Davis, and they were
complaining about how the OMII had made the Taverner code documented and well engineered and it was
unfair competition. We did achieve something.
The other one did you talk to Alex Savo (phonetic) and his team. They've developed something they call
S4 which is simple storage services for science. So I think there may be some interesting language there.
>>: My question was answered, so I'll take my question.
>>: You have a project (inaudible) called (inaudible) bioinfomatics. How do you build this concept for that?
>> Paul Watson: Gosh.
>>: Aren't you talking to the Burn (phonetic) folks?
>> Paul Watson: Yes I'll be there on Monday.
>>: Which is a similar thing to CAB.
>>: You highlighted that fact (inaudible) the problem may be in the medical community because in the
medical community is such that you don't want to share your data. So there's a huge structure in place but
really not see that much -- that subsequently will be a similar problem.
>> Paul Watson: You think reason people don't use it because they're reluctant to share their data with
others.
>>: Well, that's part of it. That's not actually all of it. There's a barrier to entry in regards to actually picking
up the technology. Also there are different instantiations of or levels of participation and hoops you have to
jump through in order to even become members of CAB.
Centers of excellence like Georgia Tech have already jumped through those hoops but other institutions
haven't been quite so fortunate in doing that. And also the other problem is that they do tend to focus on
the minority areas and not where the most of the actual oncology and research takes place which 80
percent of it takes place at the regional level not at the national centers of excellence which are actually
more academic type institutions so adoption is actually there for a lot of reason for lack of adoption. I think
it's completely another issue from this particular system here. I think this is a great model. In fact, it's
probably a better model in which that model system is based.
>>: We need to make certain we did not confuse the concept of putting the data on the cloud and sharing.
That's because your data is in the cloud does not mean that you share the data. Or your Google main is in
the cloud and apart from Google accessing it, you don't see sharing it. But that's the price you pay.
This has been part of the discussion we've had with Burn and it has to be a discussion.
>>: That would be interesting to know what users actually use. For Burn, for example, they give all these
neuroscientists Globus Condor and SRB. Do they really need that? Is that what they want? I would be
interested if you have some insights into that and similar to the CAB. I've come down to the view
mistakingly pursuing a middleware approach for five years in the UK science program, we need to do a
web service, cloud service type approach.
>>: This year.
>>: What's behind the web services, is that middleware behind the web services?
>>: But you don't require the user, 800,000 lines of Globus software on their server.
>> Paul Watson: That's the danger where clouds are at the moment. People see clouds as being these
low level Amazon services and in order to do anything you actually do need to build your own middleware
on top of that. So I think it's this hole here that's the problem and who is going to fill that in the cloud.
>>: (Inaudible).
>> Paul Watson: Well, that would be very nice. The gentleman at the back.
>>: So you talked a little bit about the part of the data analysis piece. I'm not sure what -- is that just an
example or do you think it's a key part of the stuff you guys are doing?
>> Paul Watson: So we have that because our collaborators at York, that's their technology and that's their
interest. So it's certainly been useful in applications like the arrow engine application that was a successful
project and it was used in Rolls Royce.
The neuroscientists really like being able to review their data. So they love the Signal Data Explorer tool
that I showed. At the moment, it's not clear whether the searching facilities are as useful in that
environment as they are in the arrow engine environment. Because searching for spikes, with particular
characteristics is something some neuroscientists do but many scientists are just happy to accept any
voltage below a certain level is a spike. They don't care about the particular shape of the spike. You can
easily flesh all the data to find the spikes.
>>: The reason I asked is so George Tracoski (phonetic) at Georgia Tech is super interested in doing
analysis with terabites and astral data and being able to recognize things you might not even have a
heuristic for how to recognize. We know a super nova player looks like this kind of curve in this particular
area of the sky I can imagine trying to do the same sort of thing here.
He wants to be able to grow the way you can even ask questions so you don't have to say show me the
super nova. Show me the things that march super nova or anything that might be interesting to be able to
search across terabites of data for a single experiment seems like a puzzle it's not going to be like just find
me all the spikes. Now there's only 40 million of them we have for you.
>> Paul Watson: So suggested that they get in touch with Jim's group. Because they have used this sort
of technology in many different environments. They do facial recognition with it, for example. They don't
just do this sort of spike detection or stress detection.
>>: You're talking about search technology based on dual networks or what?
>> Paul Watson: Yeah. So they've done a lot of work on neural networks. So the correlation there as I
mentioned that comes from one of their neural network techniques. They've got things like they've got
accelerate hardware as well to make this faster if the software isn't quick enough for the amount of data
you've got. They've got a lot of expertise.
>>: They have a commercial staff in the company which I think is ->> Paul Watson: Sybula. B-u-l-a. The guy's name is Jim Austin. A-u-s-t-i-n. If you want to contact him at
York University. He's helpful. So you would be interested.
>>: Savas Parastatidis: Any more questions?
>>: What's different from the -- you mentioned you were using some biological ontologies on your
e-science services are you using those to see if data is from commensurate experience or can you say do
people upload their own ontologies or extend existing ones? Is there anything?
>> Paul Watson: So amongst the neuroscientists we work with there's no tradition of using ontologies, and
very little tradition of having any sort of agreed word to describe meta data. So if I go back to just one of
these slides just to show the problem, then this -- yeah, so this -- if I can show this. So this was the only
meta data. So we produced this to mimic some data that was made available to us. And the only meta
data that was available, when it was produced, was you can see the title of it at the top.
And that was it. So you had to know what spike 0003 was to know what experiment it came from. So what
we've got is we've got people who have been involved in bio infomatics and ontologies and they've looked
at neuroscience and they've tried to define, what they've tried to do is to bind over some ontologies for
experimental protocol and fill in the gap which is the electrophysiological data for which there wasn't any
ontology at all. Now we're in the process of doing user trials to see what the users think of this and, of
course, it's quite difficult.
So on the one hand, I should draw this as a graph, but on the one hand the more information you get them
to produce then potentially the more shareable the more interpretable their data is. As you go in that
direction the less likely that people will actually bother to add it. So we're trying find where that sort of
minimum arrays, where people are prepared to provide the data but it's actually useful for us to use.
So I expect it's going to be a human problem as much as a technological problem.
>>: Savas Parastatidis: Thank you very much. Let's thank Paul again.
(Applause)
Download