>>: Let's just get started, please, with our first speaker, German Molto from the University of Valencia, please. >> German Molto: Management and contextualization of scientific virtual appliances. And here is the outline of the talk. I will start with a brief introduction and overview to our research group. Then I'm going to focus specifically on scientific cloud computing. In particular I'm going to talk about contextualizing virtual machines and how to manage them using repositories and catalogs. Finally I will talk about scientific applications in which we are applying these kind of techniques. Finally I will end my talk with some conclusions and future challenges. Okay. So our research group focuses on applying different computational techniques such as parallel distributed grid and cloud computing technologies into different scientific fields such as engineering simulation, photonics, proteomics, biomedical computation. We typically establish collaborations with research groups so that we can apply this kind of techniques. If we have learned something through all these years is that scientific applications typically require large computational power and also precisely a large amount of data. So this is why typically combine several techniques, for example high performance computing and grid computing in order to solve large dimension problems. And specifically talking about grid computing this has been a technique that has been successfully employed in many areas. And perhaps one of the most important approach of grid computing is that it has leveraged scientific collaboration in the shape of virtual organizations and has allowed to access a large wall of computing power. But it has also some drawbacks. If I had to mention just one is that using grid computing the main problem is that it is resource providers, the ones that define the execution environments. And this is one of the key area that virtualization and cloud computing tries to solve. Because with cloud computing it is resource consumers, not resource providers that wants to define the execution environments in the shape of virtual machines. And having a controlled environment is especially important for scientific applications. But cloud computing has also other advantages and it's that it allows dynamic scaling of infrastructures for resource providers. A user can have a fast an easy access to a large amount of resources and virtualization which leads to server consolidation also leads to reduced energy consumption. Now we focus on the point of view of the scientist and engineers. We don't want to bother with technology they just want to run the applications as fast as possible and to solve large dimension problems. And when we talk about grid technologies we have lots of concept and tools that complicate it like to users and developers. Another, we are talking about cloud computing, we also have additional problems, which hypervisor I'm going to use, I'm going to configure, deploy and monitor my application which APIs I'm going to use. So I think it's time to focus on abstracting the details of application porting to the cloud. And in particular on this talk I'm going to focus on the lower level or the lowest level of this typical classification of cloud systems where on the top level we have software as a service. And the user application. Odd middle we have platform as a service where application developers are provided with APIs to develop cloud applications. And the lowest level we have virtual machine managers which provide the enactment of management on virtual machines. And we can currently find lots of virtual machine managers currently in the literature. But we are trying to consider in these key factors in order to decide which one we want to focus on. And we want them to be open source, we want them to have access to public clouds such as Amazon EC2. Have wide variety of APIs, very important for assigning applications. Different hypervisor technology supported. Contextualization support. Network management. And also ecosystem, wide variety of users, large community. And having these key factors in mind, we're focus on OpenNebula, Eucalyptus, and also we currently keep and eye on what the Nimbus product is doing. Well, virtual machine managers focus on supporting the lifecycle of virtual machines. But for scientific cloud computing we also require automated contextualization of virtual machines in order to get scientific virtual appliance. And also we need to reuse different virtual machines among different experiments and also between different researchers. So this is why we focus on application contextualization and also the management user repositories and catalogs. Okay. Virtual appliances is a very well known term. It's just an encapsulation of an application, application requirements and operating system like mini-max execution unit in a cloud. So but if we're talking about scientific virtual appliances, then things get a little bit more complicated because these kind of applications might require certain operating system, certain services from the operating system; program transferred services. Persistence layer either database or files, super files. Special middleware, for example where applications might require the [inaudible] for example. Computational libraries you might have scientific examples that requires numerical kernel such as [inaudible] pack. And finally you have the application and the application data. So creating scientific virtual appliance is not a trivial task. So going from a virtual machine where you have the plain OS into a scientific virtual appliance where you have finally the scientific application running, this typically requires a process called contextualization. And contextualization means creating the appropriate hardware and software environment for a successful execution of an application. And this typically happens at two levels. Because first of all virtual machines have to be contextualized. When the virtual machine boots, for example, it needs outbound connectivity and this contextualization process, its support is provided typically by the virtual machine managers. But then you have to contextualize your applications, so it defines the appropriate environment. So the applications need to be deployed, configured, built and executed. And we can currently find lots of tools out there for matching configuration which allows dealing with ordinary matching configuration and also the installation of commonly used packages, very comprehensive tools. Many of them are in the shape of client servers tools. But we wanted to focus on the specific workflow for deploying scientific applications. And things typically go like this. First of all, you have to revolve dependencies. Applications might require other related packages or system packages so the dependencies have to be installed first. Then you have to configure applications, which means a subset of actions, copying files, changing properties, declaring environment variables. Then you have to build the application using different build systems. And you finally have to start the application which depending on the application might require invoking a script, starting an application, parallel execution or whatever. So we are currently working on software for application contextualization. And the idea is to inoculate squared and configuration into the virtual machine without minimum -- with minimum user intervention. So we currently have this approach. We have the user that has the application and the user developer writes the application deployment description in XML high level language, no programming skills needed. And together with the software dependencies this goes to the contextualizer tool, which is -- which creates a contextualization plan so that in the virtual machine panels are installed, configured, build and the application is run so that we can perform the deployment of the application at boot time in the virtual machine. So we counsel have a proof of kept tool of this application coded in Python for -to ensure good portability. We have a plugin based mechanism so that users can write just XML deployment descriptions. No need of program skills. And we currently stage the tool, the application and the application requirements into the virtual machine at boot time, creating a special disk image so that when the virtual machine boots it can start the contextualization process, we can go from a virtual machine to a virtual appliance ready to execute the application. Now, cataloging is an important feature we want in order to reduce or enhance collaboration and virtual machine sharing. And it is true there exist VM catalogs out there but they focus specifically on human conception. They don't provide APIs, unstructured metadata. And we wanted to work on a catalog that includes virtual machine metadata, a description of the operating system and the software environment. Very important to execute applications. And we use the open virtualization format which is XML based to describe the features of the virtual machine. We also want to provide links to other repositories which are either local or remote. And we are currently working on matchmaking algorithms to retrieve the most appropriate virtual machines according to application requirements. This is one area in which we are working right now. So we currently have this approach. We have the user, which provides the OVF description of the application. It talks to the catalog to register a virtual machine. And the catalog creates an instance of a transfer manager which criterias temporary credentials, which are delegated to the user in order to upload the disk image size to the repository, and this image size is reduced in the catalog an conveniently tagged so they can later be accessed. Now, concerning the repository, it includes the storage of VMs and provides data access mechanisms who currently include HTTP and FTP, very well known protocols. But we're also considering including GridFTP which would provide enhanced certificate based security. It's a protocol for transferred files. And the current virtual machines that we are considering is Golden virtual machines currently based on Ubuntu JeOS which provides a full operating system in just 380 megabytes. It's very important. A very low footprint. And also having pre-contextualized virtual machines. For example, for grid applications we have virtual machines where we have already Blobus Toolkit 4 deployed so we can contextualize yes, the Grid Services that need to be deployed. Well, so this is the big picture that we currently have. We have the user with the application requirements. And these application requirements are submitted to this cloud enactor component. This cloud enactor component talks to the VM catalog in order to retrieve the most appropriate virtual machines, which are retrieved from the repository. And then it talks to the contextualization software which must compute the deviation from the virtual machine and the application requirements so that it must create this contextualization plan, which is submitted to the virtual machine manager. And the virtual machine manager starts the virtual machine. And when the virtual machine boots, it starts the contextualization process. So that you finally have your scientific application running on this virtual infrastructure. Well, there's a missing point here in this figure and the point is how I'm going to control the application and access the output files inside the virtual appliance. And we currently rely on the Opal 2 Toolkit which is a tool developed in NBCR, and it provides a Web Service wrapper for applications. And it was initially developed to provide a wrapper to let us see applications but it is very, very useful in this kind of environment because it allows operations for remote start in monitoring and terminating the application. And very important, is that it also allows to access the output files while the dials are being produced because they are exposed as a Tomcat service. And this enables to introduce computational steering. I mean, to see how your simulations or your executions are being processed and the output data are being written in files. This very important approach. So once we have these Web Service application shall we have the hardware hypervisor and then the virtual appliance which runs the scientific application and a Web Service wrapper that allows this cloud enactor component to control these applications, to start -- to monitor them and to access the output paths. This is currently the approach that we have. Now, what kind of applications are we using in this kind of techniques? We are currently considering three different applications. Simulation of cardiac electrical activity; simulation of guided light in photonic crystal fibers; and also optimization of protein design with target properties. These are applications that we have been working in them with collaboration with other research groups in the last five years. And they are application in which we have previously applied high performance computing and recomputing and we are currently investing how these applications can benefit from these cloud infrastructures. Okay. So just to wrap up. I would like to have some conclusions. Scientific cloud computing requires tools to abstract the interaction with cloud infrastructures. So going from applications to scientific virtual appliance is a keep point. We are working at the application contextualization and virtual appliances management. And within the cloud looks like an alternative approach for executing scientific applications. And the main benefit over a grid infrastructure is that it allows to define the specific execution environments. This is one of the crucial points. We also see some challenges in the near future. One of the most important is for example infrastructure providers currently are different silos so probably get software gateways should be developed to aggregate these kind of infrastructures. And we also see a large ecosystem of virtual machine managers. Many of them share the same functionalities and goals. So we'll probably see in the near future the rise and fall of some of them. And it is true that their exists common API for cloud computing so that you can develop your application against one assemble API and then you can access different cloud infrastructure. But we'll have to see how this breaks above in time. And a final thought is that clouds and grids will provide computational support to scientific applications. Okay. And this is where I wanted to tell you if you have any questions I will be glad to ask you. Thank you very much. [applause]. >>: Questions for our speaker? Bill. >>: So you mentioned this cause of the [inaudible] VMs. [inaudible] people sort of develop [inaudible] get it running and save it all as [inaudible] so the automation step that you are offering where someone expresses at a high level application requirements and automatically [inaudible] VM, I guess I'm not quite clear why that automation need to exist. I mean, at some point somebody has to manually kind of construct the application once [inaudible]. >> German Molto: I mean the first time you have to manually configure the virtual machine in order to install the application, but you can replace this manual installation process with an automated process. And this automated process is [inaudible]. >>: You've already done the manual process once [inaudible]. >> German Molto: Yes. This is true. This is true. But this is -- this is just one single infrastructure. For example if you are using Amazon EC2, you just create a specific virtual machine for this infrastructure. But imagine that you might access different infrastructures simultaneously. And for example Amazon EC2 it uses shared hypervisor, then you can use other infrastructure that uses hypervisor so you might find in a situation where you have to deploy the ->>: [inaudible]. >> German Molto: But once you have configured the application you can save it as a precontextualized virtual machine so that you can later be reused. >>: Any other questions? >>: I have one more. >> German Molto: Okay. [laughter]. >>: [inaudible] unit in the sort of dataflow diagram you showed at the first step, the user expresses application requirements. How are they expressed? >> German Molto: Currently we're working -- we're working with OVF, but we think that the OVF is -- it is an XML document but it is lots of expressive -- I mean, we're currently thinking how the transition to another probably most appropriate format without being too much expressive because many of them -many times the applications -- the requirements of applications are -- don't need to be so much expressive. So we currently have concept using OVF that we will probably transition to another approach. >>: Let's thank the speaker. [applause]. >> Gregor Srdic: Hi and welcome. Well, I see that my presentation is very different for all the previous ones as I'm going to present you our project that was -- application was built to demonstrate proof of concept and technology and to use our tools that were available to us. I come from University of Maribor of Slovenia, and I'm part of Cloud Computing Centre there that was established in cooperation with leading business partners in our field such as Microsoft, IBM and Oracle. And the aim of this Cloud Computing Centre, which is actually the first in our country and also the first in our region, is to transfer technology and knowledge between academic sphere and business and besides research you can prepare in color page this me and my colleagues also assist at pre-graduate student programs and we try to motivate students additionally by including the newest concepts of the computer science into our classes. So as a practical part of one of our courses we have decided to give students an assignment to write down a few ideas about practical use of cloud computing, and then we've collected those ideas and selected a few of the best ones. We built high level architecture and divided students into several groups. Each group then built partial solution individually and at the end of semester we integrated these solutions into final applications, and this project called SkyInfo was one of those applications, and it showed enough potential that we decided to carry on developing it even after the end of the course. The problem that this project is referring to is I'm sure well known to everybody in today's information society every individual is overwhelmed with large amount of data, and it is surrounding us and most of this data is unimportant and only makes it harder for us to get to the information we really want. And also a lot of time is wasted processing this unnecessary data. And during this time, a lot of information that we were looking for can already become obsolete. Well, to overcome these problems, we decided to filter all available data in the relation to individual's current location and expressed interest, therefore SkyInfo is designed to offer and intelligent anytime, anywhere available service which will provide users with relevant localized and personalized information. The main idea is that on the first side users submit messages about events they witnessed, either by writing a text message or by submitting a picture, and on the other side they receive events from other users based on their current position and subscriptions. Each subscription defines a category of message to be retrieved and also a maximum distance and time validity of a message. Of course, considering this solution, we've also encountered a few problems. The first problem was that multiple people can witness the same event and send a message, so we had to develop an algorithm to identify those duplicates and treat them properly. Another problem is that after user sends a few false messages, his integrity this certainly be questioned. Therefore, we designed a rating system that works on feedback. So every user who receives a certain message can respond to it and either confirming it or denying it. Sorry. Either confirming it or denying it. And based on this feedback messaging users rating is updated. We have considered a lot of use cases for our application. These include reporting of traffic accidents or reporting of traffic jams and other traffic information, warning about approaching weather conditions, reporting lost and found objects. Another interesting idea is localized advertising and so on. Here we will take a look, a closer look at one of scenarios. Image Mary, what is walking in a park on a sunny afternoon she finds sad puppy who obviously got lost. She takes a picture and submits it to our system. And luckily the puppy owner is already looking for the puppy nearby, and he also uses our system, and they instantly receives a message with position and thanks to our system quickly puppy and his master are joined together and happy once again. Here we have a presentation of this high level architecture with the decomposed -- the application into several models. We have enabled -- with this, we have enabled loose copying of components and later I will mention that we have decided to implement each of these components using Web Service technology. The main part of SkyInfo application is distributing messages. Therefore, we've tried to develop in innovative approach to exchanging data about current events. So besides considering location and timeframe, we've also tried to simplify the process of submitting messages. To submit a message user only has to fill out a description and select a category and then our system client for that matter automatically gathers location either from internal external GPS or from other available methods. And then these messages are distributed back to clients, which continuously report their location to our system and retrieve new messages. There's also another way for urgent delivery of messages at the request of the user messages can be delivered directly and instantly over short messaging service. The way of sending messages, text messages which I just presented seems to me to be as simple as it could possibly be. But we have gone even further and we've developed another innovative approach of exchanging data about events. Here users can only submit a picture, a plain picture and that can be submitted either over to Web Service or to e-mail or to multimedia messaging and thanks to use of Amazon's Mechanical Turk service we can clarify this picture into one of available categories and of course system -- I forgot to mention the picture has to include geolocation, so this mean -- many devices nowadays already support that, so this is not the problem. Here I have an example of a human task that is build by this -- by our system. Mechanical Turk provides programming interface for building human tasks which are available online and every end user on the Web can work on them. So I think it's a very interesting alternative to computing methods and algorithms which are at least for image recognition very complex. Well, our project is still a work in progress, but so far we've built a first version of core system and to clients on -- on this picture is a mobile client. The first one is built in Android platform. And we have intention to build another one for Windows Mobile. We had some problems calling Web Services from version 6.5, so we decided to wait for version 7. Here is on the left is a presentation of new messages and on the right is a form for submitting new messages. This is a Web application which besides displaying messages enables users to subscribe to categories. It's built in silver -- Microsoft Silverlight technology. Another challenge that we face building this promise was integrating many technologies. For this purpose we've employed Web Services so those gave us kind of a bridge between different platforms. The core service of SkyInfo are built in Java and running on IBM WebSphere application server. The supporting services are built in Microsoft .NET and are running on IIS server. And we moved first of our services already to Windows Azure. And the ones that are built in .NET could easily be transferred, all of them, there as well. For human task as I already mentioned we used Amazon Mechanical Turk. And for Web interface we built -- we've used Microsoft Silverlight. Then for mobile clients we have used Android and Windows Mobile. And of course a big part of our service is using maps. We've used Bing and Google Maps. Well, the key contributions of our project I believe that linking relevant information and events with location and user interest leads to quickest access to desired information. And I think that our approach to exchanging data about occurred events but something only pictures is very innovative. So to wrap up, I would like to say again what our project is all about. Well, it's basically a framework for sharing information in a structured and personalized manner. There, we would like it to be useful for general public. SkyInfo could also be distributed as software as a service. We believe to the organizations that require private information service. That's it. [applause]. >>: Questions for our speaker? >>: I have one. So you use Mechanical Turk for application in the cloud, right, but you also have the users when they submit events into your cloud based app, they classify them as well, right? >> Gregor Srdic: Yes. They are two different ways. You can submit the message. There you have to select a category but if you submit only a picture, the picture is classified using Mechanical Turk. >>: Okay. And how is your turn-around time on Mechanical Turk classification? >> Gregor Srdic: Yeah, we didn't test that in reality, we only test it in the sand box until now. So we did the human tasks ourselves. It was quick. >>: I have a question. I live in the [inaudible] Brazil and so [inaudible] terrible traffic jams. And I would like to present to you a scenario and see how SkyInfo would respond to that. For instance, you have a traffic jam and hundreds of users starts using their cell phone sending pictures with the cars in front of them stopped. And these images are geotagged. But I'm there on Snoopy Avenue sending pictures and in my own northbound our southbound side can you decide that based on the geotag and if you're receiving hundreds of images of cars stopped, could you infer the reason of traffic jam or would then be useful if only a few cars would be close enough to the reason of why the traffic stops? >> Gregor Srdic: It's an interesting point. I'm sure that it could be useful to know that there is an accident. I don't -- with our -- what we've done so far, it's impossible to tell on which side of the road is happening. But you get some basic information. Maybe users can write it in description. That's another way. >>: Good point. I realize that you didn't mention what you do with the content of the SMS text message you receive. Those use human users to process the content? >> Gregor Srdic: No. We just distribute it to the users that are receiving messages. We made some algorithms for identifying duplicate messages that these do read the content, try to find some patterns. >>: How do you distinguish [inaudible] system in the cloud where people can actually subscribe interest in certain events and [inaudible] so you're actually generating these messages from the devices out in the field and routing them up to the cloud for subscribers or people who are interested in them in routing them to them. So my question is that's very similar pattern to pub-sub systems, published subscribe, where people subscribe to [inaudible] people publishing messages. I'm curious if you're taking a different approach, if your platform is different from a pub-sub system in terms of your processing or scalability or have you compared it against pub-sub systems? >> Gregor Srdic: I'm not familiar with them. >>: Okay. [inaudible]. Have you done any scalability in terms of how it scales in terms of event loads and ->> Gregor Srdic: Well, we actually are running this in a private cloud so -- but we haven't done extensive testing. >>: Any additional questions? Let's thank our speaker one more time. [applause]. >> Marco Parenzan: Thanks to everyone. What I want to present is also for me, for us is proof of concept. I work -- I am a computer science engineering work for a chemical engineering laboratory that is called MOSE. It does technology but will revolutionize the world of research but is called multi-scale molecular modeling. This is a -- this technique approach is applied in three main fields. That is material science, life science, and process simulation. What is multiscale molecular modeling? Multiscale molecular modeling is an approach that allow at the end to simulate a process at engineering level beginning from a quantum mechanics level of simulation. But this is a moot scale because there is now computation processing resources available now and in the near to medium to far future to simulate this are realtime. So multiscale means that we have many stages but you can buy at single, but the [inaudible] is to simulate the process from quantum mechanics level arriving to the engineering and engineering level. How do we solve the problem having many steps to resolve this modeling is a message -- message passing multiscale. So we have in general we have many software but simulate the same process or the same process remodel it at the proper level. Each result -- each input, each application have and input. It's simulation, the processor, and then the output is passed of the next level. This is mainly a sequential processor, but the sequence of activities can all obviously be modelled in a cloud. So the sequence of activities can be distributed worldwide because the laboratory -- the laboratory collaborates with many other figures, laboratories, companies, so the laboratory is able to simulate inside but can also distribute the work around the world. At the same the customers are worldwide. But at this moment the laboratory is not on the cloud. Because we are based on software that is not on the cloud. The simulation software are not on the cloud. So we made in house we have a pile of servers for which we made some simulation. But specifically this proof of concept is for us to talk with these people but give us software or internally how we can do work in computation on the cloud. Also because we need the cloud because cloud is useful collaboration platform. So it can allow us to distribute the work around the world. Can MOSE access alone the cloud alone in terms of his focus? So mechanical engineering with chemical engineers so that are the typical users. You have to understand that I am the only computer engineer inside the laboratory. So often they need also to write some code with simulation but also some code that can parameterize some of this software. The idea is to help these people writing some code, some code that can execute on the cloud. So the objectives of this research is to move laboratory on the cloud so we cannot wait the software companies to do this. So we tried to do something to make this work. As I told you before, we, as computer engineer, we can write these but there is always a big problem that is the main problem. What does a molecular mean so what does an engineering process mean for a computer, for a computer science engineer? So the idea why don't we enable non-computer scientists writing their own code, simplifying writing their own code? One aspect we have to understand that at this moment one thing is that what non-computer science engineer can do, what does -- at the moment, for example simulation software ask to these people? And it's normal today to have this kind of software are accessible all in C++ code, but is quite difficult. So, for example, one activity we have done with another group is moving the cloud to the CLR code, writing in C# or VB.NET instead of VB 6or, worst, is C++. The next step, the next step are the usage of dynamic languages like Python or Ruby and then also the usage of domain specific languages, DSLs. What can abstract on another level the way we can express code and data? Just a few slides. I'm not Microsoft but I like to share with you technologies that are from Microsoft they are from IronPython is about three, four years but is not so everywhere. There were a lot of -- few people. Dynamic languages are on the sector, and someone that comes from the open source community is now part of Microsoft is called Jim Hugunin and has developed the first version of IronPython. And the nice thing about work is that IronPython was split in the form with version 2 in two parts. One that is a specific part that is IronPython language. The other thing is the dynamic language runtime. That is the common part for dynamic language is in fact we are waiting in these days probably for version 1.0 of IronRuby. That is the other main language in the dynamic community. One other big thing is that these languages run natively in .NET, so they are completely CLR types. So completely managed code. They run under Azure as we will see in few minutes. They are quite advanced, for example, buy IronPython is already unicode code and various problems with the original Python code that is not yet unicode. And the nice thing is that dynamic language runtime allow other code, for example JavaScript that is coming from community we are having a native JavaScript language running on the runtime. And if you don't know the DLR is also under C#4.0 and on .NET 4.0 that is in the four days we will have a final version because the dynamic keyword in C# is in effect a part of that DLR. In fact, Jim Hugunin has moved to the C# team. This is one technology what we are using to allow no computer people to simplify their programming abilities. Another tool that we are starting using is Oslo. Oslo is already named as SQL server modeling because this -- we can say research project but it was not a research project, has -- is becoming a product that will be out in long time because it's one, two years. What is -- what is Oslo? Oslo is a part of a data platform from Microsoft, has this component and language, the quantity tool and the repository. What is Oslo? Oslo is a framework of tools that allows us to create dynamic -domain specific languages. So we can write code -- we have a tool on which we create in a simple, simple way languages. Mainly textual languages. Why quadrant is a visual tool to interrupt with this -- with domain language. And this is useful also because if we use storage in -- under Windows Azure, not SQL Azure but Azure storage, blob, tables, or queues, but mainly, program, blobs, Oslo, what is Oslo? Oslo is a tool that allow us to make a parser for unstructured data. In fact yet when Ed Lazowska presented -- told where the scientific world lives in a big amount of unstructured data, well Oslo is a nice tool to structure the data. In particular, Oslo is a schema -- is a schema language to give a schema to any data we want. In fact, this is the way -- this is why Oslo is get into SQL server. It was SQL server platform or mainly the data platform. So what we -- which are the simplification steps? What is our proof of concept? We have a Windows Azure application specifically Web application, Web Role in an Azure application with MVC2 and a worker role for background processing. What we do with this Web application, we lowered Python scripts for example that are executed in a worker role. How do we interact with this with dynamic language, with jobs that run on a worker role? We do input and output with textual message that are parsed with Oslo. So we see a demo to fix the correct messages. I have some just to say -- just to see what Oslo is. A file that does not relate to the example, but to see what are the structured data. I have downloaded a comma separated file from a business website which you have money change. This tool is called [inaudible]. In the center you have grammar that is an evolution of a typical [inaudible] grammar. This tool as the nice thing that you write the grammar. On the left you have source and on the right you see you have the M graph that is -- in this case is the serialization of the abstract syntax tree that is generated from this tool. So this is to show you vector -- how to structure the unstructured data. Specifically for this example. Okay. My example is quite simple. Is matrix multiplication. But suppose you had an input made in this way, not in a coding way, a way that is probably similar to MATLAB, for example. Okay. You have two variables E and M, a vector and a matrix. You see that there is a grammar and then you have the result of a parsing. That is accessible to a programming language to [inaudible] data. This is the input. The output I find, okay the output is -- the part of the output is a template in which the results are written on a text file. So in this way opposite have written grammar which I parse a template, and you see text expand, text expand. If you write you make [inaudible] write, read text is just write with text, expand is read with value, write the value. Okay. With Oslo it's quite simple to write this kind of things. So if we see the application, I have done a simple application, quite -- I would like to call it the Facebook of code but probably it's a -- oh, okay. Here we have, for example as a publisher I can write some Python code, and you see what can be simple. You don't have an environment, you don't have virtual studio, you have just a Web page in which you write the code, just the code you need. Okay. Simple matrix multiplication in Python language. And sorry that the zoom is not good, but for example here you see dictionary of math that is the schema that is the parsing of an input. I need a dictionary and I have output that is M report that is the template we are seeing that are defined in an administrator part. I have a map of dictionary and [inaudible]. So you see that this Web application can organize the code, the project for a person that has no skill, program, with Visual Studio. How you can interact with this application, you have to send the messages. So for example you have a request on which you define I want a batch processing for this component when I give input. Simple. Okay. Simple messenger. Matrix multiplication what makes that multiplication double of the -- square, sorry, of the values. Save. So in this moment when I save on table, I have already sent a message on a queue that is written on the queue. So this trigger, the worker role vector, read the message, find in which language is written the code and then you see on the refresher that is posted a message. Okay. I have made a mistake because I needed a vertical vector, sorry. So I have to put the comma -- okay. This obviously this is a Web interface that people think that we replace with this -- with a WCF Web role in which we have a REST application in which all the information are serialized and then parsed. Okay. You see has 1, 4, 9, 16 that is [inaudible] duplication. This, this is the demo. So back to my presentation. Just to say that the method is too simple important a code. This is the result of one of a simulation we do with our software. But we think that this is just a little more complicated matrix multiplication. With a matrix that is quite large so in this way can be useful the cloud because we can divide that information on multiple worker roles, multiple instance of a worker role that can elaborate that multiplication. So the result -- well, the proof of concept is vector. All these pieces work together. This is working under where if I have not shown you, but this is where Azure development environment. So this is where code that can be supplied to the real Azure account. The conclusion. Why MOSE need the cloud? Because we need the platform that allow Azure to create with messaging based application. In the demo we have seen the creation and the execution of a simple step in the process. And the input and output that are more near to the scientific world that the information world. You have not seen the XML that is -- was quite typical to write parsable text but was quite awful to be used from scientific people. The code we have libraries that define these components but invokes for example IronPython and invokes Oslo. What's next? The next is obviously continue with the project. One idea is the definition of a project. And I have updated representation just to locate our work of presentation yesterday from Paul Watson where a person made the workflow of a component. Well, our work is exactly this proof of concept is exactly the single workflow step. And now we need also the entire platform. Collaboration in the process. So how do we share this input/output messages? So blogging tool can be useful, Twitter integrated with a platform can be also useful. But the next is the verticalization on the domain. So Python is useful but we can extract again how writing a DSL that is a real language I made -- here I cannot do the demo because I have written the code in -- with a [inaudible] that is .NET 4 why Windows Azure is 3.5. And this is why yesterday I asked when cloud will be 4.0. This is grammar for a language -- a custom language that has the same output. But the idea is to have, yes, we can have a standard tool that allow us to make the normal operation probably in a more verbose way to be readable, to be more understandable from scientific people. But it will be interesting, for example, having a mixer of Python syntax, having LINQ like syntax to interact with data sources but avoiding connection strings. And this is nice because living in a cloud in Windows Azure we can construct the need for connection string and say we are on a Azure account so give me the -- that particular blob or table or queue object. So this is it. This is as I told is a work in progress. And I think it's also an invitation to use these tools because they are very, very useful. Thank you. [applause]. >>: Questions for the speaker? >>: I have one. So is your bottom line goal with all this to make -- to make it easy to find specific languages to pass standardized formats around multiple simulation codes offering each size scale? Or is it to rewrite simulation codes into something that runs in the cloud environment? >> Marco Parenzan: We can -- the first step is the idea that we can allow to abstract as much as we can the programming environment to the chemical engineer. Then if he can write code, he can try to write the code he needs. So also the simulation code. There are many -- there are many -- for example, one example is the processing -- the process engineering in which you have mix or split or reactors which can be written in some code that can be expressed in just few questions. So what we want is not to make the chemical engineer write a class -- a function but just write the equation, declare which are the input, which are the outputs and then execute it. Azure, the -- these techniques can be applied outside the cloud obviously. The nice thing of a cloud is the storage because mechanical engineer which in front of a computer does not know where to save the information. It only write text files. So Oslo can define the schema. Azure can safe these files in a readable way because it's quite important. >>: So unless it's defined and saved how would a group of chemical engineers which is distributed along several loads, how could such a group cooperate combining reusing [inaudible]. >> Marco Parenzan: In this demo you have not seen a collaboration platform. In fact, I've written that very -- we need to define a process. >>: Okay. >> Marco Parenzan: This proof of concept works on expression. So do we can abstract the programming skills to the chemical engineer so simplify the work to the chemical engineer. Can a chemical engineer write some code? This is the first answer that this work do. The next step is to make -- is to make the real platform that allow the collaboration. As I told the work presented yesterday from Paul Watson showed that other people are working on the same argument. >>: Any more questions for the speaker? So I have one for you. First I was really surprised to see Oslo, to see Oslo in action [inaudible] I really like that. But can you talk about the [inaudible] of Azure some of the simulations, what the simulations [inaudible] one worker or class of simulations you're targeting right now? >> Marco Parenzan: This work is a ticket for us to go to the simulation software companies to say well, bring your software on the cloud. So we have in Trieste there is a company that does this kind of work. There is a project on this. And this can be a sort of another runtime. We could say that in Python or on a DSL can be another specific runtime but can be the runtime for this application, for this platform, from this software, this software company to bring their software to the cloud. We know that we cannot write an entire simulation, but it's an invitation to port the code from this company have to bring the code to the cloud. >>: [inaudible]. >> Marco Parenzan: What I see is that the chemical engineer have many difficult [inaudible] so we have to abstract and Oslo is, Oslo is not a revolution in terms of languages because we have probably more powerful tools, for example ANTLR that is the component that can generate power source for us. But Oslo is a typical Microsoft tool, is a very simple tool to use. It's just we can relate to Oslo like we relate to an XML DOM. It's the same. But because you know that a DOM box -- DOM box work is one where -- in fact yesterday I was -- yeah. I show you. I have -- sorry. I was with do you go last [inaudible] and we've DOM box to ask what Oslo would perform. But DOM box says give a schema, a schema for text files for generating text files. This is the objective. >>: Great. >>: Just real quick I want to reask one part of Roger's question, which is the sort of two -- at least two sizes of simulation and ones that work in a single mode and ones that don't work on a single mode. And one part of Roger's question was do you tend to support the ones that do not work [inaudible] requires orchestration of [inaudible] codes? And if so, how? >> Marco Parenzan: As I told before, we know what we cannot rewrite and entire simulator. >>: It's not a rewrite, I guess it's just a -- there's a scale problem I suppose that even if you were -- even if someone else comes in and writes -- hand you a CFD simulation, a computational fluid dynamic simulation, that necessarily runs parallel because their performance. >> Marco Parenzan: Okay. Okay. >>: Azure may not be the best fit for that for some of the reasons Roger mentioned? Do you intend to just ignore -- which is fine ->> Marco Parenzan: No, no, no, no, I understand. The idea in this way with DSL is to model with DSL -- to simplify in DSL the way we can express where distribution of a computation. So try to abstract -- our objective is always writing vertical aspect domain. So knowing that there is difficulties to divide to the process in parallel, the idea of DSL is abstract also the parallel abstraction in the language, so to try obviously to make it accessible to scaled and non-scaled programmer. >>: Let's thank our speaker one last time. [applause]. >> Marco Parenzan: Thank you. >>: Our final speaker of the session is Domenico Tlia from the University of Calabria towards an Open Service Framework for Cloud-based Knowledge Discovery. >> Domenico Talia: Thank you. So my talk is a discussion about the strategy for the implementation of an service -- base of service oriented framework for running knowledge and discovery application on cloud-based systems. Actually the presentation is mainly divided in two parts. So the first part I will try to discuss the strong. So one approach for the completion of services oriented distributed knowledge discovery task and application on -- yeah. Okay. Today we must say cloud because we are in a cloud-oriented conference. But I think that's -- we can speak about large scale distributed system or large scale high performance computing system. So we understand what we mean. So the idea is to try to outline approach for implementation of large scale service oriented knowledge discovery tasks as services on this kind of platform and investigate how this kind of knowledge and discovery data mining services can be used to implement distributed analysis application according to the service oriented architecture model. So in the first part I present the approach and the second part of the talk I will give you some references to real projects we are running where we develop some software according to this approach that show the feasibility of the approach and the way in which this services can be used for real implementation of distributed knowledge discovery application. So, you know, we have to deal with complex problem with big, bigger and bigger problems today. And most of them came from the -- okay. Obviously we -- it's life. But the main issue here is that we have to deal with very huge data source. So data source today are larger and larger and often they are distributed, okay? So starting from this scenario we have to deal with this huge amount of data, and we have to use this data. Okay? And obviously we have a main problem -- we have one problem which is storing data, okay? And it's a problem. But it -- from my point of view, the main problem is not storing data, the main problem is to analyze data, to mine data, to process data trying to understand it. So you know, obviously one problem could be or is having no data, having no information. But another, the other side of the coin is that having so much data that you cannot understand nothing. You cannot manage them by railroad tools or they cannot read by humans. So having too much data is more or less like not having data, okay? Now, just to mention some estimate, I'm not sure how accurate is this estimate but it's an estimate, okay? It seems that in -- oh, sorry. It's not 2006 but is 2009 the 9 change its position. In 2009 it seems that we produced 750 billion gigabytes. And the problem is that in 2010 we are going to produce one zettabyte of data which is an impressive amount. And just to show you the -- I'm not sure if this is a point or not. Okay. This is the forecast produced by IDC. And you say we have a problem with the available storage in the information create. And so this it seems is the trend. So it's very impressive trend. And so starting from this scenario, we must be able to face this challenge. So to handle this very large amount of information and the associated complexity, which is related to the processing of this large amount of data. So the idea here is to use, as I said before, large scale distributed system like HPC, cloud system, grid systems, P2P systems, to provide sufficient computational support for this kind of application and for analyzing this data. For discovering the interesting part of this data. We had some talk today and yesterday on this topic. Very, very interesting topic -- talks. So that seems are in accordance with this approach and then half of that even if has been done by well known and very expert people. So the idea here is to try to use this bunch of following to support the completion of integrated framework for doing analysis of data through services interface. Okay? So the basic idea is to try to identify single stack of this big complex process and implement each step. I will show you model later. As a single service. And then composing services in a sort of ecosystem of services that may run on this kind of large scale infrastructure. Having at the hand a sort of data analytics clouds, you know. We have a big data center and data center are used to store data but not only to store data, to process, then to query. And typically the interface of data is service oriented interface. So in this case, the idea is two hats, data analysis or data analytics services for handling with this large amount of data. And obviously in this kind of infrastructures we have security facility, resource information services to identify the resources we have in terms of data, in terms of software, in terms of algorithms and so on. Communication mechanisms, scheduling, fault detection. So all this is sort of basic infrastructure that could be used for implementing data analytics clouds. Now, okay. Yeah. When we speak about data analysis, we have to deal with data mining algorithms. So with the distributed and parallel implementation of data mining algorithms, that means adopting data parallel approach or task parallelism approach, managing data dependencies. Having the possibility to define workflow, so dynamic task graphs that obviously are based on data dependencies themselves. Deal with dynamic data access and all these aspects are part of the tasks or the problem of implementing parallel data mining and all distributed mining approach. You know, there are so -- actually we may say that parallel data mining algorithm could be part of distributed mining applications. So especially in a cloud. I may have complex data analytics application that's run parallel data mining algorithms some part of clouds in some virtual machine and these are part of a larger scenario, larger application which is distributed for example in a larger cloud or in some intercloud, okay? And obviously we need to program this data mining operating task and patterns. So this is one key point for me because my idea is to implement this data mining operation as a service, okay? That could be integrated among them. Obviously you know we may address the problem at different levels and these different levels are not alternative among them, okay, so obviously we need to use, okay, traditional approach or libraries, languages, concurrent languages and so on as we know. But obviously this could be an alternative or this different approach based on components part or could be part of the solution. Could be a way to integrate both approach. And on top of that, we should think about Web services, green services, cloud services, workflow, mashup and so on. Okay. If we go up, we increase the grain sides of application, the grain sides of tasks. And going down we increase the number of processes. So the number of how to say the degree of concurrence we have, the grain of tasks cease and the number of tasks increase. Okay. So, to do this we suggest to exploit the service oriented architecture to define services, basic services for supporting distributed mining application, yeah, in large scale as I mentioned before, large scale distributed systems. For example, in private cloud -- escrow, in private clouds but also in larger clouds like interclouds. Okay. So the idea is having services for data selection, for data transport, for data analysis, for model -- knowledge model representation and also for visualization. More in detail, the idea is this one. So -- sorry. The idea is, okay, if you need to run a data mining application or knowledge and discover application, you need to identify the steps of your task, okay, of your application. Some of them obviously start from the -- from data, okay? You have to identify the data source in which run the analysis. And then, for example you can pre- processing data, filtering data and so on. So you need to prepare data for the mining. Okay? So in general, in a KDD process, this is the first part of the process. And we think that each of this operation, so each of this single KDD step can be implemented as a service, okay? And you have services for example for pre-processing data, for filtering data, for transforming data. Having them in the integral format for the analysis, okay? Going ahead each single data mining task can be completed as a service. For example, we may have classification algorithms, clustering algorithm or a priori algorithm, association rule discovery and so on. Each one of these can be implemented as a service. Okay? You may I have the code of this algorithm in Java or C# or whatever, but you may offer them as a single service, okay? So in each layer we have a collection of algorithm that are offered as a service. Then we can in a distributed setting whose single data mining task of single KDD steps to implement distributed mining patterns, okay? So I want to run, for example, a parallel classification or a meta-learning application or a collective learning application that run on a large amount of -- a large number of machine, okay? I can take single services for analysis on a single machine and compose a distributed mining pattern, a distributed mining application. What is important is that also this one could be a single service which is composed of a set of services that come from this layer. On top of that, we can go to have a complete data mining implication or a complete KDD process as implemented as a single service and implemented as a collection, a collection of previous task and patterns. Typically this could be done in a sort of multi-step workflow, okay? Where each node of the workflow can be a single service or a multiservice in terms -- so it can be composed of several elementary -- how to say -- small grain services. Okay? So we have a way to compose data mining application in which we reuse all the services in the lower level for implementing application in the top layer. Okay? Okay. So at then we have a sort of open service framework for cloud-based data mining. So in this way having single services and having a way to compose the services in an application we allow developers to program distributed data analytics processor implication as a composition of single or aggregated services available over a cloud. Okay? So these services obviously should exploit other basic cloud services, for example, for data transfer, replica management, data integration and querying that perhaps are already available in many clouds, okay? And all this is a sort of ecosystem. Service oriented ecosystem for data -- for data mining. Okay. So at the end, this kind of approach may result in service based distributed mining application for communities with -- we discussed it before, the community of chemistry people or other scientific community that need to have a way to run a data analysis application or for virtual organization. So community of different physical organization. And we may have also distributed data analysis services on demand. Because people may access the cloud and ask to analyze some data. Okay? On demand. So at the end we have a sort of knowledge discovery ecosystem. Okay. So I hope I have 10 or 5 to 10 minutes to -- okay, just to show some examples of this kind of approach. Because I tried to outline the general approach so the general way of for implementing this kind of approach and then I will show you some examples. Just to finish this first part obviously we may wonder if data mining services are or are not programming abstractions. So in terms of I would say programming language approach. So what is this? Having in a additional way this is -- is not apparently a sort of programming abstraction proposal. But I think that we should consider it data mining services as programming abstraction because if each single data mining algorithm is single data mining application could be used as a small element of a more complex application, we should have mechanism to compose them so to program them to run application -- more and more complex application. So we have basic services as a simple operation. Okay? And using service programming languages for composing them like workflow based application or workflow based paralleling we can compose the basic services for programming data analytics application. And obviously we have complex services and their complex composition towards the implementation of distributed programming patterns for data analytic services. Okay. So just to show you some examples of the approach I presented here, I give you a very quick overview of four systems. I have no time to go into details but we have a lot of material in case you are interested in all of the systems. And some of them are open source are available on the Web. Okay. Weka4WS is one approach of implementing the Weka, the Weka toolkit which is a well known open source for data mining. The only limit of Weka is that it is running on a single machine. So it's a sequential application. What we did is provide an implementation based on the Web Services so we may run it on a larger scale. So typically on the Web, so on larger scale infrastructure. The Knowledge Grid is another data mining framework, software framework for the implementation of the project proposal. And then we did some experiments in providing Mobile Data Mining Services. And this is very interesting because this is a way to couple mobile device with cloud. So you may use the clouds running also -- requesting the running of implementation application from mobile device, okay. And then the last one is Mining@Home which is which is an approach based on peer-to-peer -- peer-to-peer architecture. Okay just to show you some detail. So this is just the interface of the Weka4WS in which you have a way to compose very complex workflow of data analysis, so an entire knowledge and discovery process that typically start from a data set that could be divided, partitioned in several data set. And all of this can be run in parallel on a cloud or on a Web server in which -- so each node can be run on a different machine. Okay? So the interface allow the user to program the application at a very high level and then the system will run the application in parallel and move data where data is needed and move results back to the user when the result are available. Okay? So in this way, we program a data mining workflow and run this workflow in parallel on a large scale distributed infrastructure. The same approach has been used in the service-oriented knowledge grid. So the idea is, as I mentioned before, we have a set of services. We have a way to compose the services and to run them on a distributed infrastructure. So what are the services here? The services are data, access services, data mining algorithm of a process service, distributed mining patterns offered as a service and so on. So you have a catalog of all the software and data you have available to compose your application. Then you look at the catalog of all the resource -- all the services available and compose your workflow. Okay? So the workflow composition is very abstract. I will show you a slide of the interface. And then the system cares of running these in parallel or a distributed infrastructure. So the user doesn't word it of this transformation, of this passage, okay? And in this way the user focus on the services available and on its application, not on the distributed details -- sorry, on the architecture detail of the distributed infrastructure. Okay? Yeah. The idea is this -- this is just a UML -- extended UML approach in which you compose your workflow and then annotate each node with some information about the algorithm or the tools, et cetera. But it perhaps is better to show this interface, okay, because you compose your application as a workflow. For program, in this case we take a data set, we split because we have a splitter, we split the data on different partition and we run them on four machine, for example, okay? And then we collect the result and produce the model. Okay? So the user have this kind of interface okay? So this interface correspond to this and this layer. Then when the application is composed then all they take are green, so it means everything is available. You may say run and this is run on a distributed platform. Okay. I'm going to conclude the talk just showing what we did with a service -the implementation of a service oriented mobile data mining framework. In this case we have the clouds, we have a set of machine and we have the client that are mobile, okay? So here the idea is that we have data providers that typically are on the cloud and mining services and mobile cloud -- mobile clients that are outside. Okay. So in this case, the idea is that you can -- you can have access to the resources, select the data set you need to analyze, select the algorithm from your mobile phone and then say run. The application is run on the distributed infrastructure and the result are send back at the end. All this is done through a service oriented infrastructure to a service oriented interface because here is just a client for service invocation. Okay? So it's very simple. Okay. So what happens is that, okay, you can choose some parameters and then you see the result. Obviously here we have some problem with the representation of result because the screen is very small. But we provide an interface that tell the user to select which part of the result he valiant to visualize at which time. Okay. The last one is try to explode this approach in a sort of very large scale infrastructure using the public resources computing paradigm. So in this way, the idea is that to emulate the approach of Seti@home or similar, having mining@home. So in this case, obviously the idea is that for example for community of scientists that share data that are public typically if you think about biology data or -- okay. I'm going to finish. And they may incorporate sharing data, sharing machines and obviously running application on this machine. So this is a sort of centralized data analysis application that could be programming as a large collection of task and services. Okay. This is just a snapshot. Okay. So, yeah, obviously we evaluate the impact so I mean the overhead of services and typically the overhead is not so much. So it's a -- okay. It's a very small percentage. So having the application -- offering the application through a service oriented interface does not come indicate, does not have a very large overhead. So this means that this is -- it's feasible. Okay. We did some speedup evaluation. Okay. This is just the last slide just to conclude. So the idea is that okay, I think all agreed that the high performance computing infrastructure may allow us to attack new problems. But they're required to solve more challenging problems. And perhaps new programming models and the new programming environments are required. So in this programming framework should be devoted to manage data and to analyze data because data is becoming a very big player. So programming data analysis applications and services is a must. Okay. In the long-term vision this approach may bring to a sort of pervasive collection of data analysis services and the application that must be accessed and used as public utilities. And I think that this approach is in accordance with the approach that the cloud community is pursuing. So it's running in this moment. Obviously we must be ready to manage this very complex scenario. Okay. Thank you very much. [applause]. >>: Unfortunately we only have time for one question, but ->> Domenico Talia: Okay. We can discuss later. >>: So if there's any questions. >>: I'd just like to give a comment in the most humble sense I think you should reconsider your use of the red color in the slides because sometimes they attract the attention for what's not the essential part, okay. That's cultural [inaudible]. >> Domenico Talia: Thank you. >>: I just wondered whether you have -- it would be possible, maybe you both can answer the question. >> Domenico Talia: You may start. [laughter]. >>: No of using work flows to deploy the combination, composition of the mining. >>: I can only say that's a great idea [inaudible] abstracting a way [inaudible]. >>: Exactly. >>: I love the hierarchal [inaudible]. >>: Yeah. >> Domenico Talia: Yeah. Because if I can add something. Because users of this kind of system often are not computer scientists. So people, scientists or business -- I don't know, that I go a support to businessman should focus on the application not on the architecture of the task. And perhaps workflow service oriented approach could help. >>: [inaudible] we talk to financial analysts about their pipeline, they're doing different computations but if you look at the core level they're using the same approach with different parameters in different ways. So it's skill [inaudible] support 100 pipelines to workflow [inaudible] very nice. >> Domenico Talia: Okay. Thank you very much. [applause]