>> Yan Xu: So apparently the [inaudible] informatics is very attractive. We've got a lot of questions from the last talk on bio informatics. The next one is sort of related and it's on big data and it is by Ian Foster.
>> Ian Foster: Okay. Thank you. I'm going to--thanks for the invitation to speak--I'm going to give some remarks on a topic that I think is relevant to some of your concerns that speaks to this issue that was raised by Dan, how do we take methods that are applicable in one community and scale them up in terms of the number of people supported and perhaps the number of communities to which they are provided and I'm going to argue that we need to be looking at methods that are proven effective in industry, so-called software as a service methods to achieve that goal. I did a bit of field research recently and this is the lab of a typical physicist at the University of Chicago at least. I think we recognize this scene in many other parts of the sciences. This fellow has many millions of dollars of very sophisticated physics apparatus, but his data management solutions and data analysis solutions are not so effective as we might like, and of course these terabyte drives are actually pretty good for the data volumes he has now, but they start to breakdown as the amount of data that he has available to him increases in scale. We have this increasingly complex process as data volumes increase.
These are some of the steps involved in the discovery process. I'm sure that you could all add ones to that list. While I've heard Alex Solaye observe that if the amount of data is increasing exponentially over time then either we are processing exponentially more data per publication or a more pessimistic view is we are throwing away exponentially more data or not paying attention to it, so we are all in this situation captured by Lewis Carroll, if you run very fast for a long time you might expect to get somewhere. No, you better expect to run even faster than that because of the problems of scale that we are faced with. I don't like to say it but I have to admit that I used to go to work because, one of the reasons was because I had this wonderful technology that was far more sophisticated than anything I had at home, but that's no longer the case. If I go to work I deal with very powerful computers but in many ways very primitive technology as relative to what I have in my daily life where I have things like Netflix that can stream petabytes probably of movies to me, Google mail which just kind of works without anybody having to operate it, a system called Trippet that I make use of and let me say a few words about that. Some of you are familiar with Trippet; if not, you should use it. It's a wonderful system. If you are going somewhere it will do some interesting things. This is me.
I'm coming to Seattle, let's say. I have to book a flight. It doesn't do that for me. I suppose it probably could, but then it observes in my e-mail inbox that I have a flight reservation. It records it and it goes off to suggest hotels. I book the hotel. It records it. It gets the weather.
Okay, I'm going to Boston not to Seattle, so it's sunny. I don't believe that thing about it never raining in Seattle [laughter] I think that's… That's what they always say when I come here. It can prepare maps for me. It will share information of others, monitor prices, monitor flights et cetera. For me a personally painful, what used to be a personally painful process has now been outsourced and automated in a way that is actually pretty powerful. So you might say what's this got to do with science, well, imagine that my incoming e-mail described a new astronomical data set and just by virtue of getting that e-mail you could imagine a service that would register it somewhere, look up other similar data sets, perform some analysis, share the results with other people et cetera, et cetera. We’ll come back to what we might want to do with that sort
in a second, but first of all a few words about what's sitting behind the covers of this service.
Well, it's something called software as a service. There is an application that's operated by someone, in this case a little company called Trippet. They have a single code base that's run somewhere that delivers an actually very sophisticated service and increasingly sophisticated because it's constantly being upgraded as a service without any work being performed by any of its users to a very large number of people. There are economies of scale, and this system is highly web architected so virtual observatory people should recognize some of the techniques they use. It's engaging with a lot of other services as it proceeds. What would it mean to perhaps take some of these functions and outsource them to a research IT as a service capability? We started a project at the University of Chicago and I was one that a while ago started to look at this question and we initially focused on this little problem of moving data.
Now I know that some people argue that moving data isn't a problem because data should never move. In fact I think Jim Gray argued that in a sense, so this gives me an opportunity to quote an astronomer who when told that the Earth didn't move. He said well, yes it does move and data is sort of the same. You might not like to think that it moves, but it does because data often ends up in the wrong places and in fact, Alex Solaye has argued why that might be because data gets produced in lots of places for varying reasons and often needs to be taken to other places for storage analysis et cetera. We are focusing on this problem that data is in the wrong place and realizing that for quite a few people, not everyone by any means, but quite a few people moving data is a significant challenge. We put in place a service initially called
Globus Online; now it's called Globus Transfer which manages the movement of data from one place to another. This if you look at it online is a little tick of it ticking over at globusonline.org saying how many megabytes we've moved. We are up to close to 7 petabytes at this point.
Under the covers it's a software as a service application, so it sits somewhere, single code base, serves maybe 6000 or so users at present. You interact with it via a web GUI or a command line or a rest API and it manages the movement of data handling the many mundane but often time-consuming things that make data movement painful. Credential management, authentication managing the movement of maybe hundreds of thousands of files, optimization of network protocols, retrying on failures et cetera, et cetera. Like a good software as a service offering it achieves high uptime because we replicate states. We do other sorts of things. And because it has a very intuitive interface it actually, lots of people are getting excited about it, but let me show you what it looks like briefly. Here's our, this is the web GUI. This is actually a climate science archive that a European group has stood up on a Globus Online endpoint. Here
I am specifying that I want to move a bunch of data. The task is started. Here I'm looking at what happens. Okay, it's already a 10th of the way through in the time between I clicked the first screen and the second screen, and I can zoom in and look at things, and finally, actually two hours later here I have finished moving 440 gigabytes at 60 megabytes per second. Not a very high-speed, actually. Often we get a far greater speed than that, but it's nevertheless a being performed automatically. There was one fault in this case. Not very many, but it's a frequent concern that one of your transfers has failed and you are not quite sure which one. It moved 620 files et cetera, et cetera. So who finds this useful? First of all facilities like the fact that they can hand off responsibility for data movement in and out of their facility to a thirdparty, essentially, so many supercomputer centers, university facilities and a growing number of experimental facilities like the Advanced Photon Source are recommending Globus Online to
their users, but we are also finding interest from experiments, projects in which data movement is one of a nth part of an end-to-end workflow. For example, we have been working closely with Don Petravick as part of the dark energy survey, so you know more about the dark energy survey than I do, but as one part of their end to end workflow they need to be able to move large numbers of files every night to Texas for analysis and move the results back to
Illinois. When we first met Don he had a couple of people who were busy designing and about to start implementing a framework for moving data reliably back-and-forth between these two sites. He reckoned it would be quite a few person months to implement that framework.
They were able to simply hand off that task to Globus Transfer, therefore avoiding the need for any custom software solution. And we've also put in place for them a, as we are doing for a growing number of sites, a custom interface which provides for login, credential management, presenting them with the credentials they need for the dark energy survey resources that they are using and so on and so forth. Another major site that has adopted this technology, Blue
Waters, I guess the largest NSF supercomputer center so they've decided to use this outsourced solution to manage all data movement in and out of their facility and between their facility and their archival storage systems and again, we can provide them with a custom webpage that provides the appropriate credential management et cetera. The basic concept then is you have a complex activity which data movement, suddenly complex, it's common to many applications.
One wants to be able to take it out of the hands of individual users and deliver it once to a large number of people, so this is something that, this function is something that we have been writing code for many years and for a long time we would deliver it to people by providing essentially tarballs that people had to download and install upright and maintain themselves.
Since then we've learned a lot about why that's a bad way to deliver software functionality and we could talk at length as to why it is such a bad approach, but let's take it for granted that it is for now and go on to look at how we deliver software differently in this context. Basically, we run the software on a set of computers that themselves access databases of identities, profiles and groups and other databases in transfer state. These databases are hosted on a commercial cloud provider that provides for replication across multiple availability zones and therefore high availability, and so that's how we achieve our three nines availability. We started off trying to do this in a university setting and even a well-run, actually a national lab setting, and even a well run national lab will tend to shut down power one day of the year perhaps to upgrade some power supply. This plus other problems gives you a lot less than three nines I think in any noncommercial data center context. Another interesting thing about running a cloud hosted services you get a lot of data about what's going on, so this is sort of a fun, I find it fun anyway.
A graph, so each of these dots here represents a data transfer. This is the number of bytes transferred, so you can see I think that this is, oops, this is anything here is bigger than a terabyte. This is an amount of time taken for the transfer. For some reason I have colored those involving the National Energy Research Center here in red, so there are a few really big transfers, multi-hundred terabyte transfers up here that I think are, some of those involved
LIGO. LIGO has been using Globus Online to move a lot of data between their archives. There are some very long transfers. The longest one has run for a couple of months. This was between some mass storage systems which had a very slow data rate in and out. You can also see transfers that are greater than a gigabyte per second. There are a few of those, not very many. Those that are more than 100 megabytes per second, a lot of those, more than 10
megabytes per second and so on and so forth. Lots of information is in here which we are using to guide optimization both of Globus Online and also national and local cyber infrastructures.
That's data movement. We might then ask well, what other tasks can we seek to outsource in a similar manner. I think there are a couple of axes that we can take as we pursue that goal. We can look at functions that are common to a very large number of scientific disciplines and then we can look within individual disciplines and find functions that are common perhaps across many projects in those disciplines. We have been focusing mostly so far on the first of these issues. So here it is sort of a schematic that we found useful to describe the activities that are performed in many, certainly not all, scientific projects. The data is collected in some manner.
It is ingested. It is perhaps cleaned. It is annotated, validated and at some point it is copied to remote archives, backups. It's published. Users access a community store and a registry, perhaps contribute further components, et cetera. So Globus Transfer addresses some of these arrows in this picture. We believe there are other services that we should be providing to address other aspects of this picture. So some of the ones that we are working on at the moment to expand this are storage, providing mechanisms for managing storage and access to storage, something that people often seem to find complex. Catalog, making it easy for people to create catalogs, particularly people involved in smaller projects where the ability to keep track of what you are doing and the information that you have collected is important, and then because we often find that once one has created and shared some data, one wants to be able to collaborate around it a set of so-called domesticated collaboration tools driven off the same identity and group management services as are provided for Globus Transfer and Globus
Storage. Basically what we are doing here is, therefore, expanding beyond moving to annotating, sharing and publishing. What's more to do beyond this, of course, but it's a first step. So let me say just a few words about some of these other services that we are developing. Globus Storage addresses the following sequences of steps that our experience shows occurs frequently in scientific projects and not well addressed using existing tools. Some individual or person or project has some data. They want to place it somewhere in some convenient manner and this could in some way be a campus computing center, a lab computing resource, a commercial storage service provider like Amazon or Izua [phonetic] or a national research center. Once they've done that they want to be able to access it anywhere via a range of protocols, HTTP for convenient access, Globus Transfer for high-speed transfer, perhaps some kind of desktop synchronization protocol. They want to be able to update it, version it, take snapshots. A common workflow that we see is people put some data onto a storage system. They review it. They update it and at some point they decide that it's ready to be shared with others. They want to take a snapshot, a read-only snapshot and then share those, that snapshot with perhaps a small community or perhaps anyone with an internet connection.
They may also want to be able to take a particular version of the data and synchronize it with a remote computing facility, for example, to perform analysis, if you don't have computing directly associated with the storage system on which your data is located. So each of these things is actually reasonably difficult to do using current technologies. For example, imagine that you've got some data on your campus resource and you want to be able to share it with about half a dozen colleagues who don't have accounts on your campus system and that turns out to be a nontrivial thing to do in many cases. So Globus Storage addresses those capabilities. A few words about how we--what's, this isn't supposed to be annotated--about
how we implement it. As with Globus Transfer, there is a set of capabilities which are stored on cloud resources. In this case it's the Globus Storage management services. All of the file system metadata and again, Globus nexus credentialed group identity matchup, so by putting the file system data on a cloud resource and the actual data on either a conventional or cloud storage system, we are able to provide these helpful storage management capabilities in many different settings on, for example, the University of Chicago research computing center on a supercomputer storage provider or on an Amazon or Google or Microsoft Izua cloud storage provider and in each case we have this high-speed GridFTP access and HTTP access. Globus
Collaborate is another--how are we doing for time, by the way?
>>: Time is okay.
>> Ian Foster: Wonderful. Another service which, so the interesting observation here is once you've got credentials in groups managed for a community or for individuals or for a community, then you can start to do a lot of things with those credentials in groups. Typically want to be able to not only share some data with a particular group, but also maybe share documents with that group, track tasks, send e-mail to that group and so on and so forth.
Globus Collaborate is an expanding set of collaboration tools that are being adapted to use the groups that we create, so for things like Drupal, Jura and other capabilities. Again, under the covers these systems are running wherever they may run, but they are making use of this cloud hosted information which we maintain for these research communities. I think I will skip through this. This is just a little too busy animation showing what you can do with this. To start to wrap up, the overall theme here is that data acquisition management analysis, exploding amounts of experimental and computational data and text, suddenly, this so-called big data problem leads us to this need for what we might call a big process, increasingly automated processes that are able to deal with increasing volume, velocity, variety and variability in our data and increasingly collaborative analysis of that data and my assertion is that we are not going to be able to deliver that big process if we depend on individual researches, labs and even institutions operating the appropriate services. Instead we need to find ways of leveraging economies of scale that result from delivering research IT as a service, matches consumers and businesses take advantage of so many other capabilities that are delivered as services. As some of you may know, a small business nowadays doesn't tend to hire any IT staff. They tend to outsource essentially everything involved in running a small business to third-party providers.
A small lab doesn't tend to do that, but it should. I want to mention that I think the other work that we've done on Globus Online is pioneering some new methods but we are certainly not the only ones to be pursuing these sorts of ideas. I will just mention three examples here. MG-
RAST is an Argonne-based project that delivers meta-genomics sequence analysis as a service, so this is a rather old slide. They're up to I think 56,000 meta-genomes that have been uploaded to their site for analysis and what users that I meet as I travel tell me is that it avoids the individual user having to hire a postdoc, get that postdoc to spend six months learning and installing software et cetera; instead they can simply upload their sequence and get the result returned to them immediately and many people find that liberating. NanoHUB is a system that's been running for quite a few years and claims now to have hundreds of thousands of users of their nano electronic simulation software. And then finally a local project that some of
you may know about is SQLSHARE, Bill Howell at the University of Washington, who is delivering basically database, path or database capabilities as a service and the idea here is that people should make more use of SQL databases than they do. One of the reasons that they don't is the labor involved in installing and running these servers and also the high learning curve that some people perceive as being associated with them, so SQLSHARE seeks to first of all avoid the effort involved in learning about how to install these systems by running them for you, and secondly guide you through the process in developing a schema for a new database.
Okay, so, thank you very much. Lots of support from various people and here are some pointers to more information. Thank you. [applause]. Yes.
>>: So is a Globus Services all provided for free or is there a business model that is associated with this? How does that work?
>> Ian Foster: Yes and yes [laughter]. Yes, so that is a key question. We are very concerned about this question of sustainability. I believe that software as a service has the ability to address the scalability challenge, first of all by reducing costs per user and secondly by providing a framework by which you can charge and get some money back. So far we are not doing that. Our view is that we should be proceeding as follows, deliver as is commonly done in industry the basic services for free and then charge people for some degree of premium access, so for example, one thing we are talking to a number of sites about is premium services that will help them manage the data flow in and out of their individual resources, so you can move data in and out as much as you want using the Globus Transfer service that we provide today, but if you want to be able to have a greater degree of visibility and management on that traffic, then we can provide tools that will help with that. So we have this vision of a nonprofit organization that will find a way to sustain these services for the long-term. Yes?
>>: A quick question, are you planning to implement a VO server on this?
>> Ian Foster: This means a visual observatory server, because VO often means visual organization, for me? Yeah. So that sounds very interesting. I don't know what it would involve, but that would be something that we could perhaps talk about.
>>: There is a VO protocol which does similar things for doing data transfer [inaudible] space…
>> Ian Foster: Yep, okay.
>>: So we, there is prior effort which we did with people at SDSC a couple of years back where
[inaudible] an earlier version of the Globus technology, so it's probably something that should be [inaudible].
>> Ian Foster: Yeah, that would be great.
>>: [inaudible] other such things that you're doing.
>> Yan Xu: Couple more quick questions here and then, I'm sorry.
>>: Yeah, can you just say a few more words about the registry thing and in particular if somebody starts a process and they look at what's going on and they decide it's not doing what they want are they taken out of the registry so, so the registry is one of the countermeasures for publication bias, so I'm wondering if that's what you have in mind.
>> Ian Foster: I'm not quite sure of the question you're asking. So you are referring to this thing I called the Globus catalog, or the transfer management?
>>: Well you showed a register in one of your flowcharts and so I just wanted to know more about how that worked.
>> Ian Foster: Oh, I see which you mean. So we observed that the frequent requirement--this may take too long--the frequent requirement is for people to--so we are interested in providing a service and we are working with a fellow called Cowell Kisserman at ISIO, a service that would allow you to very quickly stand up a registry that may have a freeform or unconstrained schema and then provide mechanisms that allow you to put data into that registry et cetera, et cetera in a sense that like simple DB but for more science oriented implementation. Now, what policies govern the use of such registries is entirely up to the community and one would imagine many such registries, some would have policies with a very strong community curation process and some might be totally freeform.
>>: Yeah, and I think that's important.
>> Yan Xu: [inaudible] and then that gentleman there.
>>: So, yeah, quickly, I'm curious about in terms of Globus Storage and catalog, there are things like DuraCloud and Dataverse and how would you put this in context with those activities?
>> Ian Foster: Yeah, we have been talking to both of those groups and also the DriEd [phonetic] storage people and so I am not yet sure of the relationship, we want to provide tools that let people manage any stores to which they may have access and in a way that provides a set of capabilities that people seem to find useful, versioning, snapshotting, access control, so you could imagine someone like Dataverse building on that capability if--certainly if they had started doing what they were doing after we created Globus Storage they might do that, but nevertheless they may still find it a useful capability. I think they are providing a vertical solution that addresses a particular use space, where we are trying to provide crosscutting capabilities.
>> Yan Xu: So last question.
>>: So quickly, you talked about how companies often now outsource their IT requirements, yet labs and scientists don't. I think one of the reasons for this is that somebody has to pay into
[inaudible] to get funding agency to pay for that, whereas they will happily pay for a postdoc frequently to do some science because it's training as well. So do you see this changing, this paradigm changing? How do we help the change?
>> Ian Foster: I think that's the key issue, so the economics of academic research are murky and there are various perverse intense incentives. I gave a talk like this at NSF that I think was rather sad when a physics program manager said well, if you do this how will physics postdocs get jobs because they won't learn how to do Lynix system administration? [laughter]. I don't think he was joking, so I think the one question I could ask is how many of you pay for extra storage on Dropbox or some equivalent system, SkyDrive [laughter], yes, I'm sorry. So if the costs are low enough then I think people can, labs can afford to pay for them because there are, and that works because there are economies of scale, so are there similar economies of scale to be achieved in research I think is a key question.
>> Yan Xu: If you wish to talk to Ian further, please do it off-line because we are running out of time.
>> Ian Foster: Thank you
>> Yan Xu: Thank you again. So if some of you still have questions perhaps maybe your question will get answered in the next part. [inaudible] big data astroinformatics in the cloud, that's what Ravi is going to be talking about.
>> Ravi Pandya: I'll just click here, oh. Well, you've seen this graph once already today. I think it's kind of a requirement that any talk about BioInformatics includes this graph. It's actually used pretty dramatic if you look at it. Moore's law runs at about an order of magnitude every seven years. If you look at the graph since 2007, it's three orders of magnitude in four years.
That's a huge difference. What that has done is it has driven a huge increase in the number of genome sequences. This is a summary of roughly the number of total human whole genome sequences done worldwide, which is in 2010 there were 2700; in 2011 there were about
30,000. I don't know how many there were in 2012, but it's going to be in the hundreds of thousands and that doesn't even count a lot of other genomes that are significant and interesting. Now meta-genomics which is looking bacterial genomes, plant genomes, other creatures. There's probably another order of magnitude beyond this in different kinds of genomes that are being sequence, and each of those is a significant amount of data. If we look at the medical applications of genomics and in particular what I'll focus on here is human whole genome sequencing, at the moment we are in a phase where there is some genomic research going on and essentially most of those 30,000 genomes were done for research purposes of one kind of another. There is the whole thousand genome project. There is a similar one beginning in the UK called 10K genome project and a variety of other ones focused on specific diseases. The next phase of this is translational medicine, which is really some larger scale analyses not quite clinical studies but really looking at particular disease populations and so on and those often involve, you know, thousands of patients for let's say diabetes or Parkinson's disease, those kinds of things. Then as this starts to move into the clinic, the numbers get
substantially larger, so within the U.S. there is on the order of a couple of hundred thousand cases a year of children who have unusual symptoms and they try to figure out what it is, and genome sequencing can be a real benefit in those situations. In fact, right now at UCLA it's possible for a physician to simply order a genome test and a genome panel in these cases and there has been a number of very significant successes, and at the moment that uses exome sequencing which is about 1% of the genome. Apparently soon that's going to be whole genome sequencing. There is a similar case; the next big jump is going to be cancer pathology.
There are about 5 million cases a year and I'm going to go into some detail about that later and then finally kind of the end goal where you have personalized medicine that is really tailored to your personal genome. You can expect that at birth your genome will be sequenced and that will be used to guide your medical treatment, lifestyle choices, a variety of things over the course of your life. If you look at the amount of data involved here, as it starts to move from research into translational and clinical use, the numbers are substantial, even assuming some fairly large degree of compression, the cancer genomes alone will run on the order of 15 petabytes a year, which is a fairly significant amount of data and this is also data that needs to be handled carefully because it's personal health information, we need to handle it reliably. It needs to be replicated for backup and for access and you need to have very good security and privacy controls around it, so there are not a lot of institutions that are going to have the capability of managing that amount of data, that amount of scale with the kinds of security and reliability concerns, so we expect that this over time is really going to move into some sort of managed cloud storage infrastructure like Windows Azure. Let's take a look at cancer in particular. Without looking at the legend here, you can see that this is the different all cause mortality in the United States over the past 40 years and without looking at the legend there are basically two things that you can see there. There is red which is cardiovascular disease and the yellow which is cancer, and over the past 40 years cardiovascular disease has gone down by over half because of a variety of treatments that have been very effective and in fact, if you look at the trend in the next year or two it's going to be less than cancer. Cancer, unfortunately, has stayed stubbornly pretty much constant over the last 40 years. But, there is hope for that. Cancer is fundamentally a genomic disease and you get, every time a cell replicates you get on the order of roughly 1 in 10 to the -8, one 10 to the 8 bases will have an error when it's replicated. If you sort of multiply that by the 6 billion bases within a human genome, how long it replicates, they are on the order of a trillion mutations that will happen in your body in the course of a single day. Most of those are either benign or the cell simply isn't viable afterwards, but if you get mutations in the wrong place and a number of them that combine in the wrong way, that will produce cancer. In particular, there are a set of molecular pathways and the sorts of things that David Reese was talking about earlier today involved in cell growth and replication and also in controlling cell death that are involved in cancer, so each of those mutations that cause cancer are concentrated in a number of specific molecular mechanisms. There are a number of new drugs that target those mechanisms very specifically, and there are several you might have heard of, Herceptin, for example, for breast cancer.
There is Gleevec and so on, and when they work they can be very effective. If you have melanoma with a very specific mutation called BRAF V600E there is a drug called Zelboraf which is a kinase inhibitor that changes the network activity in a particular part of the cell metabolism and so this is after 15 weeks there is a very dramatic change. Now unfortunately in this
particular case eight weeks later there was a relapse. Cancer is not simple. It's a very complicated disease and if you look closely, in fact, you'll notice that many of those recurrences are in the exact same place that they were before. The drug was very effective. It got rid of almost all of the cancer cells, but not enough, and so what we need to understand is to really get a better idea of the complex nature of cancer and to do that we need more data. There is a project called the Cancer Genome Atlas which is involved in collecting data about a variety of cancers and at the moment it has about 300 terabytes of data gathered from 5000 different cases of cancer, about 20 different kinds. At the moment what it has in there is a set of samples of both tumor and normal genomes but it only samples the exome. This is about 1% of the whole genome that has, that is expressed most frequently. It isn't even all of the genes, but it is a substantial amount. This is very soon going to change to whole genome sequencing. The encode project that David Reese mentioned this morning has found that somewhere between
20 to 80% of the genome has functional consequences so the exome sequencing is really not enough to really figure out what that is. So TCJ is expected to grow to about 5 petabytes of data over the next couple of years and 25,000 cases, which is still a tiny fraction of the actual cancers that occur in the United States alone, but it will have tumor and normal whole genome sequences as well as RNA expression data for those tumors, and this is a database that is now hosted, it's managed by the University of California at Santa Cruz and hosted in the UCSD
Supercomputer Center. Myself and Bill Bolosky of Microsoft are working with some folks at UC
Berkeley, a UCSF and UC Santa Cruz and we are actually going to start looking at this data and starting to mine it using large-scale machine learning to try to understand what are some of the interesting patterns that occur in cancers, and so let me go into a little bit of detail about what the genome sequencing process looks like. You start off with a sample genome which basically has, you have 3 billion base pairs in the genome, T, C, G or A. Since you have two chromosomes, paired chromosomes, you actually have 6 billion base pairs that you're going to try to figure out. The modern [inaudible] sequencing process then takes that and replicates it.
If you want to figure out a single genome typically you'll replicate it about 30 times. It's called
30X coverage. And then chops it up into sequences of about 100 base pairs long and then uses a photo chemical process basically attaching fluorescent dye molecules to each base pair in turn to be able to sequence those hundred base pair fragments, and in the end what you get out of this process, that comes out of the sequencing machine is about a billion reads each of which is about 100 base pairs, and then from that what you want to do is reconstruct what is the original sample. It's a similar problem to if you had a newspaper, you had 30 copies of the newspaper. It was shredded into very small pieces and then you try to reconstruct what that was. There are a bunch of problems that arise in this because the data is very messy. It is. The genome is not purely random, for example, there is a lot of replication and a lot of duplication and there are a lot of errors introduced into the sequencing process. What's typically done, the first step is to take a reference genome. There are actually only a couple of genomes that have been sequenced from scratch. The human genome project and HuRef 1, which actually is Craig
Venter’s genome. Typically they use the human genome project as a reference and we'll take all of these reads and try to align them into a particular place where they might best fit on that reference genome. That doesn't actually reduce the data, but it gives you some useful information about where that, which gene that maps to, which chromosome, what particular area. Then having done that you do a process called variant calling which is now they will have,
many of these reads will overlap since you have replicated about it 30 times and based using a statistical process you will then take a look and figure out what the difference is, the variant of that sample from the reference. That's a much smaller data file, variant calling, so if you take your original 300 gigabytes of data and you end up with on the order of 100 megabytes, so it seems like it would be very easy for you then if you want to compress the data to just take the variant calling information. Unfortunately, that's not so easy. If you take a look at the different algorithms that are used for variant calling, this is a comparison of the three major variant calling algorithms that are out there. They don't agree that well and this is for looking at a very simple process which is just looking at single-based pair changes which you would think would be fairly easy to figure out, but if you take the same, the exact same data files and feed them into three different pipelines and you get three different sets of mutations. The agreement, pairwise agreement is roughly 85%. If you look across all three of them the agreement is about
70%, which means that you really do at this point need to keep the raw data around. So I'm going to back up a little and talk about some of the work that we're doing here on the alignment process and how that can help with data compression and data understanding. This,
SNAP actually started as an interesting testament to what we are doing here also, which is taking a look at projects and presentations outside of your basic, your primary field of expertise. It started off when Bill Bolosky, who is down at Microsoft Research was attending a presentation at the UC Berkeley amp lab by David Haussler of UC Santa Cruz about cancer genomics and at that point sort of the standard way of doing alignment was doing something called the Burrows Wheeler transform and Bill looked at this and had an idea that given the increasing length of reads and the error rates, the lower error rates that we were seeing, a very different approach might work well and so what SNAP does is it takes the entire genome and builds an index. It takes seeds, typically about 20 base pairs long and looks at where and how many times does this seed occur throughout the entire reference genome and it builds up a fairly large hash table for human genome reference. This is about 40 gigabytes, so it's not unreasonable these days to have a machine that has 64 gigabytes of RAM to be able to run this and so the combination of the longer read lengths, the lower error rates and the availability of high memory machines makes this possible, and then when you wanted to figure out where a read best fits on the genome, you can basically just take seeds-- this is sort of a sample that might be a short read. You take seeds out of that genome, look up in this index where they occur and then you take the entire read, compare it against that computed location in the genome and you find out where it's going to best fit. This turns out to be substantially faster than the state-of-the-art. It can be run for 30 X coverage run, it runs in about 15 core hours.
That is basically as fast as it can read data off the disk in order to process it and that's 10 to 40 times faster than the earlier generation of BWT-based aligners. So what used to take a day, takes about--okay--what used to take a day takes about an hour. So I'm going to skip over this in the interest of time and talk about reference-based compression which is once you have a fast aligner you can now deal with some of the problems of taking this data and moving it up to the cloud. You can't just, it generally wouldn't be feasible to move 300 gigabytes at a time up to the cloud, but if you can do a fast alignment, you can take your data, align it with the reference genome and then you can use that purely for compression purposes and what you do is take that and send just the difference against the reference which you can get, you know, 90 to 99% compression out of it, and now you have 3 to 30 gigabytes of data which is quite
reasonable to move up to the cloud. Where we would like to get to with SNAP is where you can build, where you have your instrument, where you generate that 300 gigabytes of data from the sequencer, you do a fast alignment and you now you have a set of compressed reads which you can reasonably upload into a cloud service and then in the cloud you have both sort of the general platform capabilities that we have. You have the reliability and scalability and the
HIPAA level privacy and security compliance that's important when dealing with health information and you also have general sort of machine learning, the cloud numerics data explorer capabilities for building analysis pipelines on that data and then what we are in the process of doing is really looking at what are the sets of genomic services that we might want to provide on Azure for managing that data at scale in order to have metadata about each of these sequences, both the patient genomes and the reference genomes, and then being able to query across them. Because once you have hundreds or thousands of genomes, you really want to be able to then look at specific slices across many different patients. You would like to be able to look at correlations between the genomic data and the phenotypic data, what disease states do they have and whatever health information you might have about that person and/or any other sorts of temporal data you might want to look at. We are in addition to building software like
SNAP that we are doing ourselves we are also looking at working with other people in the community such as Galaxy which is a genomics pipeline built on top of the Globus toolkit which you just heard about and GATK which is a standard genomics analysis pipeline built by the
Broad Institute of MIT at Harvard and building that so that we can build an ecosystem where the data is managed securely, you have a secure and easy to use platform in the cloud for managing that. There are sort of common domain specific services and then some general analysis in computational services that we can provide on top of that.
>> Yan Xu: Thank you.
>> Ravi Pandya: Thank you. [applause].
>> Yan Xu: He can take a couple of quick questions.
>> Ravi Pandya: Yes.
>>: So when you said that this, 30 gig was it, was a reasonable amount of data [inaudible], so that suggests that that's still the main barrier. The amount of data has to be somewhere in the tens of gigabytes or…
>> Ravi Pandya: That seems like sort of a reasonable threshold per patient especially when you start talking about thousands, hundreds of thousands, millions of patients, that seems like a reasonable threshold.
>>: So right now what is the total amount of data you have in the cloud?
>> Ravi Pandya: So at the moment we are still in the process of just getting the basic reference data. We don't have any, this is sort of looking at some of the stuff we'll be doing going
forward. At the moment most of this data is kept in different research institution’s private clusters.
>> Yan Xu: Thanks again, Ravi.
>> Ravi Pandya: Yeah, thank you. [applause].
>> Yan Xu: So next we are going to move to a panel discussion. I think we have a good number of people in this room so that we can be really interactive. For now they suggest, so we have panelists Dan Fay and Jeff Dozier and Doctor Vera. So we have panels scheduled for each of the half-day sessions pretty much and we have suggested topics for each panel, so I would appreciate it if we stay with these topics so that we can have as much coverage, and now…
>>: [inaudible].
>> Yan Xu: So this one is suggested as focusing on the [inaudible] exchange and about lessons learned from collaborating across disciplines. Perhaps I can have each of the panels starting from Dan may be and say a few words.
>> Dan Fay: Is that the end? [laughter] I guess I sat at the wrong end [laughter]. The biggest thing that we've kind of seen from working on some of these projects looking at interdisciplinary portions is the first part is actually the vocabulary, the language, the discussion points of actually talking between the different disciplines. A lot of times that takes up to six months to actually get on the same language because just the terminology on certain things just mean, just have different assumptions built in. There are also assumptions built into data sets and information that, again, unless you're in those fields you may make the wrong assumption on those and how they were collected and maybe the precision on them and so forth, so that's one of the things that we look at a lot. It's not just one of those where you just jump in or you can very easily sit back and say hey, I'm going to pull all of the data in very quickly together because you still have to have knowledge of the information and know what the assumptions are built into the data and the language of that domain.
>> Yan Xu: Jeff?
>> Jeff Dozier: So, I have an idea about how to do this in a way that's different than what we usually do and I've talked to Yan about this for the open data, open science group.
>> Yan Xu: Uh-huh.
>> Jeff Dozier: The way that we often do these interdisciplinary collaborations works pretty well but it's cumbersome, and that is that we often get, you know, a small group of researchers together, hopefully a small group, and we write a proposal to some interdisciplinary, some opportunity and we get the money and we spend a few years working together and we, and that's often very fruitful. We don't want to throw that away, but it's also somewhat
cumbersome in that, that you have to form these partnerships and it may well be that the partnerships that you want to form are with people that you don't know and so what I have thought of is, I have a fairly complex problem that I work on and there are lots of steps in it and in using the phrase that Jim Gray and Alex Solaye use about always going from a working solution to a better working solution, it's a problem that I have an end solution for it, but I haven't really solved it. In other words, there are all sorts of steps along the way where it can be done better, and so my idea on this is is there a mechanism, or let's find a mechanism where
I can publish what I know about the problem and make data sets available and what that means is that then people from the community can jump into this and there is enough context to the problem that anyone can look at it and say I know something about this particular part. It's a little bit inspired by the chess match of Kasparov versus the world in which people could jump in and look at it and make suggestions about what the next move ought to be and in some cases if you've read that story there were a couple of cases where there were really some insightful moves that came out of the community that really surprised Kasparov, and he said it was one of the most exhilarating matches of his life even though eventually he did beat the world on it [laughter]. So there are examples of this in pattern recognition of images, in coming up with better algorithms, in having better communication mechanisms, but I really want to figure out a way that I can collaborate with people that I don't actually work with and perhaps whom I don't know and I think this might be a mechanism and so I am planning to convince the people here at Microsoft that they are going to do this.
>> Yan Xu: We'd like to use this as one of the mechanisms to get us started. Doctor Vera?
>> Eduardo Vera: Indeed, I think what we are seeing is a dramatic change in the operational model or working professional model of how we collaborate and it's bringing many new avenues and probably changing very radically the way that, towards getting us new degrees of freedom on how to approach problems and so I agree with what's been said. I just want to focus a little bit on what that means in terms of challenges for the institutions and specifically for academia. Just to share the experience of the University of Chile, we started the center of mathematical modeling that is very broad and it has attracted people from different departments and created some kind of an interdisciplinary platform, but also a neutral territory that has allowed people from different departments to collaborate better. Still, there are many, many challenges. We want to create a new course, you have all of these possibilities but you crash against the walls of very established traditions and just yesterday Yan and I were talking about some of these traditions going back many centuries, so it's not easy to change them.
>> Yan Xu: Right.
>> Eduardo Vera: And we did manage to create an astroInformatics course and that meant putting together two departments, computer sciences and astronomy. We did not manage to put together applied math, although our center is really a spinoff of an applied math department. There you start finding all of these trivial barriers--they should be trivial, but they seem to be very, very strong, so we need to pull the right incentives. I think the incentives are
not aligned with what's happening and we are reshuffling the way that we can interact and exchange knowledge but our careers, our publication records are available in a very traditional way and sometimes my biggest challenge is to encourage young people to look aside and move forward and that's not trivial, so I think that although we are seeing, and it is really exciting to see how with these old software tools you can really move data in a completely different way and look at it in a different way, I think we still have a long way to go in terms of adjusting our institutions to those changes.
>> Yan Xu: Absolutely. Questions to the panels? Yes?
>>: All really good points, but let me pick up on Jeff's. I think essentially what you are suggesting is a form of professional crowd sourcing or expanding the concept of open data and open code to open the research process.
>> Jeff Dozier: Yeah, I think that's a good curtain.
>>: And I think this would be a wonderful thing, but reflecting on what Eduardo just said, I think academia needs to learn how to give credit to and reward in a nontraditional fashion.
People don't share the research process now because somebody is going to steal their idea or their data and publish the paper and they get the credit, but surely we ought to be able to figure out some sensible manner in which credit can be allocated ranging from a single sentence in some blog post for some brilliant idea, all the way to solving the whole damn thing.
So if we have credible ways in which people can participate even if they open up the research process I think that would be a right step to go, so I don't know how to do it.
>> Jeff Dozier: Well, I have an idea but Alex has got his hand up and…
>> Alex: So what Jeff says I agree with very much and I am about to [inaudible] the Google site where I heard the talk about a project that Terence Dowell started. He's one of the rock stars that put [inaudible] and I think it's called [inaudible]. [inaudible] the world's leading mathematicians created an open source [inaudible] that they collaborate together in order to solve problems and bring together unusual mix of expertise and [inaudible] people [inaudible].
I think that this is a wonderful idea and [inaudible].
>> Jeff Dozier: Okay, I think the mechanism that I would use is there are some old guys around who don't care if their ideas get stolen [laughter] and so what I'm trying to do with Yan is to try to kind of look at the Jim Gray 20 questions model and say can we get 20 people in different domains to do this and then see what happens, and it means that for Microsoft it's not a longterm investment that goes forever. If it works, that would be great and we would keep it going, and if it doesn't work, you know, we go away. But I think recruiting a small, a group of people who are known very well in their disciplines whose CVs are long enough that adding another paper or two doesn't make that much of a difference that we can do this and then if it works, then it may well inspire people at the more junior level who are worried about the length of
their CVs to participate, that there are benefits to sharing that outweigh the disadvantages.
That's my optimistic view.
>> Yan Xu: So I am going to have a sign-up sheet and ask around for people to sign up for this.
>> Jeff Dozier: Okay and then we have…
>>: [inaudible] proposing is [inaudible] research date [inaudible]. I have a problem [inaudible] research [inaudible] there is a community of professional [inaudible] and then also [inaudible] by the fact that it might quality index increase in this index is also beginning to be asked
[inaudible] more or less. So I think some move [inaudible] solution also [inaudible] maybe
[inaudible]. It's already working [inaudible].
>> Yan Xu: Joe is next, yeah.
>> Joe: I think as usual it's the sociology that's most difficult here. People are not yet even used to the open data idea, never mind now open idea idea. In astronomy there is so much data and people are still clutching to it and for example sky server [inaudible] open from the beginning. Anybody can access it. People don't use it because they are suspicious. Why are they giving it for free? And so there is that psychology that well, if it's free certainly it can't be very valuable.
>>: Yes, unless you put ads with it, then you would have [laughter].
>> Yan Xu: So perhaps a student here anybody, perhaps a postdoc, a student here could share some thoughts about this? [inaudible], you were going to say something.
>>: Yes, yes, I think it's interesting that we probably have to create new indexes were ways of measuring the impact. The citation is a traditional one but there has to be others and when you see that there are software tools that allow you to monitor very well how information flows, I mean, I think in a certain way just as when you analyze the web you realize that it's aware that everyone is interested in. Somehow we could monitor how much information flows or comes to people and perhaps that could be a good way of creating new measurements of the impact of this.
>> Yan Xu: Yes, that would be one plus side of this.
>>: I think that this psychology and sociology aspect is really important. It's even reflected in the publishing model that astronomers follow. I've had some of what I thought were some great ideas that I couldn't solve. If you could find a solution to that it would have an impact. I have no idea. I can't publish them in today's market because when I submit notices or ABJAY
[phonetic] or something like that I have to have a science; I have to have results, and I have to be able to demonstrate it. I can't just publish an idea. Here's an idea. I get a report back saying well if it's such a good idea go on a show me how useful it is by applying it to something. And I
say I can't solve this particular problem [inaudible]. So the way we publish things also needs to change which might help to shift the psychology of the way we see the value in the process starting an idea through the technology through the results and they may be different teams that solve different parts of the process, contributing parts of that process. They might be totally separate, but changing the publishing model, I think, will go a long way to helping the average astronomer adapt to this new way of doing this.
>>: That's a good point, so David Hilbert would probably not be able to publish his famous presidential essay with 23 problems today [laughter].
>> Yan Xu: That is one of the main topics that we wanted to get discussed.
>>: Our discussion on a new paradigm for publishing, yeah. Can I…
>>: You sure.
>>: I want to steer in another direction. I think something that Dan was describing, for many years we've been doing this kind of thing that there is a pairwise interaction between astronomy or other science group and some computer scientists and yes, we spend six months learning each other's lingo or more and eventually something is done. The same process repeats again, all of these pairwise interactions. I thought the whole point of having something called E science was to get into something more scalable so that we don't have to again, reinvent the wheels and relearn the language and all of that, so I think we need to think about what kind of mechanism can foster that. I would say how applied CS is the new mathematics, the new universal language and again, if you can phrase our problems in that way just like you can write a set of equations, so that everybody can understand what that means, maybe that's a way in which we can share meteorological advances and ideas across the different fields, but right now we have this cottage industry that just doesn't scale.
>>: I agree, and I was discussing this with [inaudible] earlier and in the past and [inaudible] I think will be speaking tomorrow. Are there ways to even categorize some of the algorithms that get created even on our side in a machine learning and some of these others that are created are there ways to not only describe them I'll say in more layman's terms that can actually be utilized so others could read it, you know, and say well, maybe this could work, but to see another possible application of those algorithms rather than just the specific domain that they were created for. So if you read the paper it will be very detailed on why that algorithm was created, but there's not that, unless you have that knowledge sitting there and you are having that pairwise interaction and you are having that understanding of how it could be reapplied in some way. So are there other ways to come up with language or mathematical model that you could actually interpret?
>>: Yes. There is actually quite a bit of work going on in how to classify using semantics.
Semantics web mining.
>>: Right. And then there's this other corollary problem of saying how do I describe my problem in that way as well so that you can actually even narrow it down to some of those.
>>: I think Alex.
>> Yan Xu: Alex?
>> Alex: I’d like to bring up another issue. There was a paper that came out almost 10 years ago by a man named Sean Ante. I would like to encourage everyone just to Google his name and read it. It's a two page paper and plus. It's about what happens before a new scientific discipline emerges and how interdisciplinary cooperation not in the optimal way [inaudible] transition [inaudible] before that transition happens. It's a wonderful [inaudible] and very controversial article [inaudible] and I see many people in this room with [inaudible].
>>: What is the name? Could you repeat that?
>> Alex: Sean Ante.
>>: Anti-disciplinary science, but it's A-N-T-E.
>>: [inaudible].
>>: And it's only two pages and if you don't like it it's only two pages. [laughter].
>>: So one way, another way of sort of fostering this sort of many to many collaboration is something that the citation of data sets in papers. In some ways some of the biologists have done this better, like the protein database that Phil Bourne operates that you can look up and see what proteins have been cited in papers, but I might have more collaboratory interest if I know what other people are using the same data sets that I'm using. For example, we, some of the publishers now send you e-mail when somebody cites a paper of yours and I find going and reading that paper is illuminating, that it leads me into some new directions on what people are thinking about, but if I also had similar information about what people were citing on the same data sets that I was trying to analyze, often from a different perspective that that would be very useful information.
>> Yan Xu: Is that sort of flavoring the [inaudible] slab, the new interface [inaudible] slab?
Yeah, perhaps you can, and on that?
>>: Datasets in astronomy are normally cited through the papers that describe them, but the linkage is there. There are privacy issues so you can't see who is reading it, but you can see that they are being read and you can see what is being read by people who read other things and it determines what these people are interested in and those kinds of things going together allow you to pretty much what you are suggesting.
>>: I think the earth and environmental services are backward in that context. It's, we talk about the problem a lot but usually the citation goes in the acknowledgment and therefore not easily searchable.
>>: Right, that makes it very difficult.
>>: The data mining, text mining will solve that.
>> Yan Xu: Right.
>>: Well, but…
>>: [inaudible] inside ADS. If you want to text mine it, do it.
>>: Yeah, it's usually, often the citation is too vague that we thank the producers of the global land data assimilation system for their wonderful work [laughter] kind of something like that.
>>: Well, one analogy that I have is having all of these wonderful data and archives and all of that and not being able to share the tools to explore them and discover stuff, is like you go to a telescope and everybody has to build and bring their own spectrograph, which is crazy, so I'm sure that there is a lot of wasted effort and time and talent in reinventing things all over again even in a single discipline like astronomy, never mind between different sciences, and I thought that's exactly what E science ought to solve and I can't think of anything more clever than what we are already doing in just trying to reach out, but certainly having some kind of open collaborative platforms for exchange of ideas with some kind of guarantee that people can get credit if they suggest something that's really good, maybe that's something we should try.
>> Yan Xu: So what you suggest, I agree with you. E science is really built up for the platform of this kind of collaboration and then there is the E science conference where we find each other and the E science conference by IEEE and by Microsoft Research for the past several years, how many people here go to the E science conference regularly? So a very small amount. So that is really, I find it [laughter]. Come and join us in Chicago this year then, next year in Beijing. If you didn't make it to IAU, right? It really is a great platform for inter-design.
>>: I think the reason that that's a solution looking for a problem.
>> Yan Xu: Yes.
>>: That we think it's a great idea so we're going to go to find collaborators but we don't know why we need them.
>> Yan Xu: Only the solutions are presented. You are right. Yeah, versus ideas.
>>: Yeah.
>>: Well, part of it comes down to the fact that if you focus it on the problem and the challenges that people are having within the discipline, and you actually do the talks more based around that, than about how great I was, or how great our solution was and then you do it. And that's the place we don't see a lot of these discussions at other conferences. I'm trying to analyze this data this way. I'm using maybe this algorithm with this tool. Hey, have you looked at this? Hey, have you just tried this? Again, not giving you the solution, but giving you a nudge in maybe a different direction that you never would've looked at. So yeah, I do agree with you that it could be a solution, or a, yeah, looking for the problems, but it's also one of those things where how else do you find and re-create the hallway conversation that you might have within your domain or within larger cross domains?
>> Yan Xu: George?
>> George: In some sense the science informatics like Astro or Bio or Geo are the interim step.
I mean they are talking to a particular discipline and people understand what's going on and then in some sense E science is a method informatics that extensively will help them exchange, so I was hoping if we could get something from bio informatics or whatever and we learned that they are hoping to get something out of [inaudible] formatics which [inaudible].
>> Yan Xu: Yeah, by nature I personally find that other than what I learned from the agenda of the E science workshops, it is the off-line [inaudible] interactions, the social kinds of things during the events that I learn a lot from different disciplines. The next one.
>>: Even before cross disciplinary collaboration for new research is needed and of course it is.
There is communication of existing knowledge in the field which is poorly communicated to the disciplinary field, so we have algorithms and methodology applied, but [inaudible] and science
[inaudible] over here we have the astronomer who is a planetary or stellar or galactic or cosmologist astronomer and unfortunately as an editor of a major scientific astronomical journal, I can personally say that a median, asymmetrical distribution, a median expertise by an astronomer and by this very good astronomer, in methodology is close to what I would personally call pathetic. They don't even know what a good undergraduate knows who has taken an undergraduate degree, no less a graduate degree in any of those methodological fields. So I give that [inaudible] and so in addition to forefront pioneering cross disciplinary research [inaudible] needed and funded by two gentlemen here from NSF and other people like that, we also need improved [inaudible] standard method, standard but advanced technology to the average astronomer disciplinary scientist and I suspect biologists and other disciplinary
[inaudible].
>>: Yeah, you're right.
>>: [inaudible] powerful suggestions.
>>: No, there are hard problems that have been solved by someone else and finding those solutions is often difficult, yeah.
>> Yan Xu: Let's go there and then you. Go ahead.
>>: It seems to me like all people are here together; we are kind of special people. We all agree on most of these ideas and everything, but if you go to the average astronomer and tell him about open data and working in these disciplinary groups and stuff like that, they usually say yeah, okay, but I don't really care about that. So what's the way of convincing these scientists that they need to work in an interdisciplinary environment?
>>: Results. That's the only thing that people pay attention to.
>>: Funding.
>>: Well, yeah. Resources and then results.
>> Yan Xu: And then, do you have a question?
>>: More a comment.
>> Yan Xu: Please.
>>: Another advantage of this metalayer that may be represented by E science conferences is you discover that there are some areas of [inaudible] bio informatics or like [inaudible] are actually exemplary, that you discovered that you can teach little about what we're doing and that's reassuring and also you see that you have to focus on some things, strengths, for example, the [inaudible] of astronomical data is something that is not obvious if you are embedded as an astronomer, but you see that [inaudible] strength that you can build on it
[inaudible].
>> Yan Xu: Yeah, I think I will go back to your point that everybody who is here it is already selfselected. Like a group of people who drives Prius's. They just selected to believe in hybrids and so we here already agreed and see how we can promote what we have agreed on and then increase the community effort.
>>: And Eric made a really good point and the reason why people don't use these tools is because they don't know that they even exist and we barely teach students statistics never mind proper data mining or other things, and the only reason why we start teaching them is because we are in research and so we need young people to know how to use them, so that really leads directly to the research connected with computational education and transformation of curricula that we'll talk about on the other side.
>> Yan Xu: So I wanted to culminate the momentum and then by Thursday we are going to burst it out and have really [laughter].
>>: Maybe we can solve it all.
>> Yan Xu: We will have to use some help for it. Yeah, [inaudible] you were going to say something?
>>: I think that's important because we see that across the board even in the technology spaces across them, that the amount and the speed of the advances in many of these areas is happening so quickly that it's almost impossible to keep track of them all, not just in your own domain but also across the new methodologies and algorithms that are created, let alone what you are doing in your own field and then dealing across it and so it really is a challenge and so how do you--we all have to get bigger brains I guess.
>> Yan Xu: Uh-huh and you may have a suggestion?
>>: Or all have ADHD.
>>: [inaudible] but the brilliance is that it's not only results because the results come out of the
[inaudible]. [inaudible] usually they are justified by the results, so this must be a paradigm shift. It's not something where you get it slowly, because we are aware of this problem with
VO. I mean we are doing an enormous job [inaudible] observer. This is a story with the result all this work basically is being almost ignored which is incredible. This is an enormous amount of work. It's changing the way astronomy approaches [inaudible]. So there is a problem related to it which needs to be approached from both sides of at the same time, because we need to teach a new generation. This new generation needs to be more motivated by their supervisors to learn a different type of routine in a different way. Otherwise they will just become clones
[inaudible], like in my case all of my students are my clones, basically. [laughter].
>>: Let's hope not.
>>: And they have to modify [inaudible] whoever in this room [laughter] [inaudible]. You needed to modify the way that you teach and you must find ways and results are not the solution because if you say oh, that's enough paper, we can just convince the [inaudible] to pay different attention to this type of [inaudible] and another thing, the results are the way. I think that may be specific [inaudible] that we are trying to make [inaudible] rather than make the
[inaudible] can work. So basically there is a [inaudible] from the small group of [inaudible] that this field is promising and let's begin to open the challenges in this field and let's make specific goals for this thing. Let's stimulate the community [inaudible], otherwise [inaudible] laziness.
There is no way [inaudible] I am old enough, why should I learn a new trick. I am an old doc.
>> Yan Xu: Yes, I agree.
>>: So I think I am looking forward mostly to the discussion because all of these things as to how it must be put together, must converge together and must find a way out because we have been talking about these things for the last five years. If you want to start from [inaudible] but we can go, for the last three years [inaudible]. Knowing in many cases who is [inaudible] in other communities. Sorry I want to say this and then I will shut up. I was discussing before with
Ben, here we are the elite of the people that do these things and the game amount of the proposal [inaudible] in these sites community competences. It's a stupid way to build a community. It's a pleasure to meet friends. It's a pleasure to discuss, but it's not the way you enlarge it. How many students [inaudible] confidence in China.
>> Yan Xu: Unless it's in Beijing.
>>: Of the same one, I mean [inaudible] works. We have the goal possible [inaudible] and we are here still stuck with the videoconferences.
>> Yan Xu: So that goes to the online publishing kind of sharing ideas, again.
>>: Yeah, I think that is the only viable part of this [inaudible] of course [inaudible].
>>: I think in a virtual platform we should still have conferences as a time that you reserve to be thinking about the subject and interacting with people on that subject and not doing everything else that you're supposed to do every day.
>>: Right.
>> Yan Xu: Let's take more comments. There?
>>: I like to say that I think astroinformatics is both a familiar effort to create this field and different at the same time. Astronomers have already a few times created cross disciplinary subfields with taking the knowledge from the inside of another field. The one that comes to mind is astrochemistry. [inaudible] the 1970s raised. A few people [inaudible] United States and others elsewhere really created this subfield so by roughly 2000 it really existed; everyone understood it. It had a little bit of infrastructure, certain databases, the University of
Manchester [inaudible] molecular reactor race, but essentially it's self, it's self propagating.
You just use Manchester data and you do a science study, you publish it in the [inaudible] journal and now everyone understands molecular astrophysics. They didn't have to go out and convince a thousand people. If they only convinced ten people and then later on a hundred young people to sort of teach them how to do this to learn how to do this to learn some physical chemistry. Astroinformatics is partly like that. A small group of astronomers in this room and elsewhere learn advanced methodologies and computer scientists work with them to apply two different astronomical problems and publish it in a journal with the important results. The publishing has not really begun yet. It didn't begin astrochemistry in the 1980s; it took some time, so in that sense it's similar, but I think in another sense it's different. In so far as 10,000 astronomers don't want to learn chemical reactions from master chemists, but a
good fraction of the 10,000 astronomers may really, really benefit from astroinformatics and so we have both a research subfield crosses whatever fields to create in the lines of astrochemistry and we have what I think is familiar and I think we have a more novel effort of promulgation of methodology and non-experts who will value, take value from it and that's the part that I feel is sort of marching grounds.
>> Yan Xu: Yeah, please?
>>: Both of these two points are about dissemination and one of the places we disseminate things is in conferences and the other is journals. But the problem with I think the devil, some of the problems tend to disseminate and lessen the [inaudible] is where you publish these things. The two [inaudible] astronomy journals are too boring and the two theoretical
[inaudible] journals and the two applied [inaudible] journals [inaudible]. [inaudible] journal just launched astronomy [inaudible] tends to be focused entirely in that area of things that are, who will things which are from both of these possibilities and so one of the definitions of our subdiscipline. It came from [inaudible] journal and both boxes are now [inaudible].
>>: Just quickly I want to say again, [inaudible] journal that may be, I mean [inaudible] it's useful. For sure it's useful for people who want to have [inaudible] so-called [inaudible] and so on. For me it's a [inaudible]. [inaudible] much more modern and much more flexible and much more usable ways than just [inaudible]. The astro page maybe works [inaudible] result
[inaudible] academic structure there is no way to get the position from [inaudible] if you don't even publish a paper [inaudible] paper you can't [inaudible]. So [inaudible] doesn't change anything from what the [inaudible] is concerning because of the [inaudible] which is the place where we all [inaudible].
>>: We could have a one-hour discussion about this.
>>: [inaudible]. The single thing is, again, confidence. It's confidence and not simply waiting for the solution. [inaudible]. Each other , see every time the same, it's always the same people.
I mean [inaudible] get together [laughter]. [multiple speakers] [inaudible].
>>: I remember when you were [inaudible]. You were, you were, you were, I mean it's
[inaudible].
>>: It's basically… It's, one of the most beautiful things is that you meet old friends [inaudible].
>>: I mean we have a community of [inaudible] 2000 people. I don't know how many there are that are technicians from the [inaudible]. We are not capable to reach them [inaudible] doing something wrong because [inaudible] what we are doing here is the future of astronomy.
There will be no astrophysics if we don't find a solution to this problem. And I am upset because 95% of my colleagues do not realize that.
>> Yan Xu: I don't mind to grow old with you but I see more new faces. Matthew is next.
>> Matthew: Both of you just used words which are in line with what I'm going to say probably.
It's wonderful that we see so many midcareer and senior career people here, but the problem is that if we are to define something for the future you need to engage it at the young level. How many people here are under 35?
>>: A few of them.
[multiple speakers] [inaudible] .
>>: In spirit [laughter].
>>: [inaudible] astronomy is no junior level [inaudible]. And we don't have those here. We don't have the next generation, so it's wonderful that we are architecting this and I'm saying that this is what we need, and it does need high-level senior people promoting it at this stage but we need to get engagement at a much lower, younger level. So the question is how do we do that?
>> Yan Xu: Right. I'm sorry. That gentleman first.
>>: I was going to come back after Eric's point, enough about astronomical history but it's my understanding that molecular clouds wouldn't have been recognized until the ‘70s and the entire idea of astrochemistry really couldn't originate much earlier. I wonder if there is any deeper parallel there. We couldn't even, or you could've tried and probably would've been laughed out of the pages of [inaudible] because there's no way you could ever have multi-atom elements occurring in interstellar space. Everybody would know that was crazy until the ‘70s, whereas today we have when you say [inaudible] and I wonder if the concern is we are in the
‘70s [inaudible] [laughter].
>>: And, and let it go from the ‘70s to the zeros in only five years.
>> Yan Xu: Just a few more comments from here and then we should go to lunch. You go first.
>>: I do think that conferences, that these kinds of meetings are crucial because first of all the face-to-face meeting is not possible to be replaced by any other mutual contact for many reasons. Now we are running out of time so I can't really get too far into this. This true goal that you're not going to disseminate here you have to disseminate to the wider conference, so for example, there are conferences like [inaudible], like other more generic kinds of meetings in which you can have these kinds of social interaction that at some point it will [inaudible] somehow because it's, I mean it's been proved that cultural trends follow the same dynamics of
Darwin's selection so if it is the fittest it will survive and it will conquer its needs. And of course the other thing is that I completely agree that at conferences you don't find a lot of young students. That's why we should have schools. They could be schools for astroinformatics, but they could also be schools for more generic astronomical or astrostatistics school and there we
can show what parts of the deal because you can do a lot of things using the standard techniques. You can do a lot of other things using less known techniques and that's where you make people, make young people understand the difference between the techniques, the potential between the different techniques and there you can have some success because I wouldn't really worry about convincing people. If somebody thinks that these kinds of techniques don't suit him, he is free to do so. The thing is that if these techniques are really superior and should be able to publish much more than him. I should be able to be much more successful than he is and in the end the fetus will survive and prosper.
>> Yan Xu: It may take longer though. [inaudible] get back to [inaudible].
>>: I just want to say that maybe the number of people under 55 that are inside this room for this conference is not the right metrics [inaudible]. The kind of steps for the astrostatistical measures of astroinformatics or whatever you are doing [inaudible]. I am an example. I have been cooperating very closely with the list of I would say [inaudible] who never heard of astroinformatics or astrostatistics so nothing there. The highest level of accomplishment in terms of informatics techniques applied to astronomy was [inaudible] which was a wonderful example of what the [inaudible] doesn't really think about astroinformatics will do with the informatics tools. We started collaborating and now I can see that they are [inaudible] of this kind of enterprise because I showed them that it will do something in a much bigger, will be a much quicker accomplishment than what they used to do, so what I understand is that we should try to widen our reach in terms of people who are strictly coupled with astroinformatics as it is now, we also have to try to measure a more effective way of the viewpoints that we are having. I may be lacking examples but I can tell you that what's happening in my, you know, network, my very close network in terms of professional [inaudible].
>> Yan Xu: Absolutely, so perhaps we can close by having each one of you make a short comment, if you wish.
>>: I'll actually just make one up off of that. It's actually an interesting idea and kind of goes off of the working to working. If people already know of something and kind of have familiarity with it can you just keep extending them along that path by building upon the knowledge of something else that they've already been using? Good idea.
>>: I'm finding this conference to be fairly useful.
>> Yan Xu: Wonderful [laughter].
>>: Because, you know, by and large I only know a few of the people here and I find I benefit, I enjoy going to meetings where I don't know most of the people and, you know, so hanging out with astronomers is for me, is interesting and fun.
>>: [inaudible].
>> Yan Xu: Yeah and then to [inaudible] using astronomy you are welcome to join us in bio informatics and [inaudible] informatics conferences.
>>: Absolutely, I think that despite what we all say, we must look at new things but also use some of the conventional wisdom I think in the sense that human networks are essential. We need to motivate people. We all have too many demands and what makes the difference is being motivated and we are usually motivated by people, so these meetings are important and responding to the fact that perhaps we don't have too many, my students will have an impact for the fact that I came here because I will be able to transmit some instruction to the lower level layer and that's very important. Somehow if we kill all of what's established we won't get anywhere. Actually we need to somehow articulate the new ways and the old ways in a good way. It's a little bit like architecture. The best cities are the ones that combine very modern buildings with respectful tradition and they do it in an effective way. I think that that should be our inspiration.
>> Yan Xu: And thanks again to the panelists. [applause]. So lunch is all set up and then we will come back here for one o'clock to catch up the time. Thanks so much.
>>: Oh you are welcome.