1

1 >> Ray Norris: -- completion in a couple years. It consists of 36 12 meter antennas, which is actually not that big as far as radio telescopes on the whole are concerned. The really innovative thing about it is this 192-pixel phased array feed. This is a Schmidt telescope, if you like, in radio astronomy. This makes enormous difference to the survey speed, and this is a survey telescope. It gives it a 30 square degree field of view. So it's even better than the Schmidt. You can regard this step, this brand new technology, no other telescope has yet equipped with this. It's high risk technology, but it looks like it's working okay. This is a bit like an optical astronomer's migrate from a single channel photometers to CCDs. And we've never looked back. And I think the same is going to be true of radio astronomy in 20 years' time. My guess is every radio telescope is going to have a phased array feed or [indiscernible]. So the antenna is built in China. [indiscernible] delivered on the site and we're extremely pleased with them. The surface RMS is twice what we -- sorry, half what we specified, twice as good as what we specified and they've really worked out very well indeed. We're very happy with them. That shows them a month or so back being assembled in the desert they're assembled in the factory in China. They're taken in pieces, shipped out to Australia, resembled in the desert. Those of you who know radio telescopes know that the next step is that you then adjust the panels using halography. We started to do that. We found that no adjustment was required. And, in fact, on the 36 antennas, we haven't had to adjust a single panel. They're all just like they were set in the factory. Very impressive. Good bit of engineering. So the control building, that there houses the correlator, which I mentioned yesterday absorbs ten megawatts of power out where there's no grid so it has built a small power -- state government's fortunately, building a small PAF station to run that. It's about the same as a small town. These are the phased array feeds that we keep on going on about, and you can see there, that's the front of it. That's the working surface. And you can see the sort of checkerboard. And if you like, every two corners between the black squares is full of [indiscernible]. They're very close decoupled. And then we have 182 receivers behind that. 2 >>: Every receiver is going to be correlated with every receiver and every telescope ->> Ray Norris: No. It's a bit more -- so what actually happens, they get beam formed into 30 beams, and then each beam is then correlated with every beam in every telescope. Okay. So that is a few weeks ago. All 36 antennas completed. haven't managed yet to take a photo including all 36 antennas. to go up in a 747 or something to do that. Unfortunately, Someone's got A big milestone a couple of weeks ago when we got the closure phase. That's when you correlate the signals and feed antennas using the phased array feeds and basically hope that thing's zero, which it is. And three days later, we got our first image. I hasten to say this is with only three antennas. It's not the world's greatest image, but it shows us all the data parts are working. I actually thought it was incredibly impressive three days after getting closure phase. I can say that because I'm not part of the engineering team. I'm very impressed by what the engineers have done. Okay. That's all I'm going to say about ASKAP. ASKAP science. ASKAP is survey telescope, and so there's an open request for proposals. They had 38 proposals and ten were selected. So EMU is what I'm going to mainly talk about. WALLABY is a corresponding H-1 survey and then there's eight others. All very important. Many people here were involved in the VAST experiment, which is transient and probably some people involved in some of the others. I'm going to talk about EMU up there. So EMU stands for evolutionary map of the universe. So it's an all sky continuum survey. Those are all the details. Probably the four key points, many of you will know the NVSS, which is the largest all sky radio survey at the moment. So you can compare this to NVSS. And it goes 45 times deeper than NVSS. This isn't, of course, a criticism of NVSS. I always say NVSS has been fantastic workhorse for almost 20 years, and technology moves on. As a result, we will have a database of 70 million galaxies, which we'll detect. All processing has to be done in realtime. I'll show you the data flow in a second. And our plan is put all the data in the public domain. So the idea is once we've finished all the commissioning and debugging, there will 3 be a -- you observe a part of [indiscernible] of ours. You [indiscernible] to the processing is finished. We have a data quality control step and hopefully, 24 hours later, you'll find the sources. That's our goal. So there's no proprietary period. And unusually for radio survey projects, we're hoping, planning to do cross-IDs and redshifts as well as part of the survey. I'm hoping somebody else can do it. So to show you where we fit in with the grand scheme, this shows all the current, big radio surveys. So we've got sensitivity going along here. So left is good. And that's the area of sky. And up is good. So up here is good. Down there is bad. And so you've got NVSS up there, which I mentioned, which is the biggest sky survey, which we all know and love, which we all use every day, which is the largest but not very deep. The deepest one until very recently was [indiscernible] observation which goes very deep with the VLA, but it's just a single pointing. And on Astro PH is a new paper by Jim Condon, which is the same field with the [indiscernible] VLA going a factor of a few deeper. So it's now currently the deepest radio survey. And you'll notice that all of these surveys are bounded by this line here. Atlas I'll mention. That's the survey we've been doing on the Australia telescope, which I'll mention in a minute. That's important because it's similar in some ways to EMU. So that basically, that line corresponds to a few months observing time on any modern radio telescope. You can't get much to the left of that with existing radio telescopes and that's where EMU sits. So it really is out there in the white space. And for those of you who like to think about discoveries plus square centimeter of parameter space on a diagram you, can count up the number of discoveries there, and you see why I get excited about EMU. To go from NVSS to EMU, okay, so referee on a paper recently said just need more observing time on VLA to do the EMU. So it really >>: [inaudible]. a factor of 45, so if you want to do -- a this isn't a fundamental limit, because you the VLA. Yes, you do, it's 600 years with the is a fundamental limit, actually. 4 >> Ray Norris: Yeah. Come on. >>: Take your longevity pills. >>: Fundamentally because you have the multiple beams. >> Ray Norris: That's right. Basically, these phased away feeds gives a factor of 30 increase in survey speed. >>: Do you find the galaxy first in continuum and then find the redshift on the [indiscernible] line, or vice versa? >> Ray Norris: Can I talk about that in a minute? Or, no, I'll answer it now. So yes, we're going to produce complete [indiscernible] sources. We then do cross IDs against optical and [indiscernible] sources. WALLABY, the H-1 survey, will produce redshifts of about one percent to our sources, the nearest one percent, basically and realign optical for the remainder. Right. So here's the data challenge. I mean, these [indiscernible] big numbers, but we [indiscernible] so many big numbers, probably nobody is going to be at all impressed. So we get nine terabytes per second out of antennas. We have this ten megawatt correlator out there, which reduced it down to ten gigabits per second. We have a supercomputer. That's in Perth where we basically do the heavy duty data reduction. So the net result is our UV data comes to 70 petabytes a year. But we can't actually afford to store that. And so we're going to store four petabytes a year. But all the images, but we throw away all the UV Dat -- all the structural line UV data. We'd like to keep it, but we can't afford it. >>: [inaudible]. >> Ray Norris: It's a few million for four. So it's a bit significant. Actually, do you know the number? Okay. So for EMU, we just want to store the images and the catalogs. So our images are 100 terabytes, which is pretty reasonable. And when we extract the catalogs, we have about 30 gigabytes of tables, which again is quite reasonable. And by the time we added the optical data, my guess is 50. That's a bit low. Anyway, it's a manageable amount. 5 We know what the sky looks like at that depth, because we actually have observed small areas of the sky. This is actually the Atlas survey, which I mentioned before. And it's important because it's to the same sensitivity and resolution as the EMU survey. And so we're actually using this as a training set for many of the data things I'm about to talk about. So if you look deep down in there, every object you see there is a galaxy. About half of them are star forming galaxies, about half are AGN. So it's different from NVSS, where virtually everything is an AGN. Once you get down to this step, the sky really changes. Most things you're looking at actually [indiscernible] star forming galaxies. But here, we have a nice head tail, a relic and a cluster and so on. Okay. The science goals. Well, basically, the science goals of EMU is to look at the formation and evolution galaxies. That's the big thing that's driving it. To go into a bit more detail, we can break that down into a couple of science goals. But turns out we can also do some really good cosmology with it, which is something we didn't realize at the beginning. Clusters, we're also going to do the galactic plane. Basically, it's a by-product. And, of course, legacy value. But what I'm really going to focus on in a minute is this explore an uncharted region of observational parameter space, because that's actually quite an interesting problem, how you do that. Technical challenges, obviously lots of things with image processing, dynamic range and so forth which I'm not going to talk very much about. I'm going to talk very briefly about these three here. Okay. So the source extraction, we have the way that EMU works, with the EMU team is 230 people all over the world contributing varying amounts of time, as you can imagine, and the real work is done by the working groups. So we have a source extraction working group run by Andrew Hopkins. And the interesting thing, they did a face off between all the existing source extractions, things like S-extractor, Sfind, or things you know and love in [indiscernible] and so on. We find that actually none of them work that well. All of them miss sources, they introduce spurious sources, particularly if you're going to run them in an automated way. None are actually going to do the job. 6 So we've been developing source extraction tools and one of the nice things about the survey, something that I didn't expect is that we keep on generating journal papers, even before the survey starts, which is not what I expected at all. But when you look at the processes you need to do to make a survey work like this, the tool is out there and the processes are out there so you end up doing new stuff, which is nice. We have a cross identification team led by Loretta Dunne in New Zealand. We're going to automate the cross-IDs. This won't be a simple nearest neighbor thing, which she's exploring algorithms at the moment and probably end up with a Bayesian thing. [indiscernible] is involved in the Bayesian bit of that. And so with the available surveys, that's Wise, Sloan, sky mapper, VHS, mainly, we expect to be able to cross-ID about 20, 25 objects. The remaining 20 percent of the images are just too faint and two of these are mainly high redshift AGNs, which you just can't see in the optical [indiscernible]. And there's [indiscernible] which are interesting but complicated. And they're hard. And so for those ones, we're working with the Galaxy Zoo people, which I don't need to introduce in this audience, and we're developing with them a thing called Radio Zoo. So this is the interesting thing. We've got here, there's a box standard spiral there, and other galaxies are interacting with it. To the eye, it's pretty bloody obvious what's going on there, and you've also got some [indiscernible] regions and things in there. But for now, it's really hard. So we're developing this things called Radio Sky, and we have to have a better version, at least in a couple months hopefully. Redshifts, we've got a working group run by Nick Seymour. He [indiscernible] did the first redshifts for cosmos. And I think Mara was very graceful about it, but I think she was pretty upset when she found that her nice template models, which had years of experience built into this, actually performed worse than a KNN, which is written, knocked up by some bright student, Peter Zin. So anyway, they're doing a [indiscernible] challenge there, looking at different classes of objects to see which algorithms work best. The chance I will end up with, as everybody else does, using an ensemble of 7 these. Okay. Paradigm shift. For redshifts, you don't always actually want to know the redshift for every galaxy. I'm talking redshifts here, but the same goes for a lot of things in a big survey. What you actually want, you want to do some test. Let's say in cosmology. You want to go through a group of galaxies and you want to know the redshift distribution of those galaxies. This is actually quite different from asking what the redshift of an individual galaxy is. So if I take out all of our radio sources and I ignore the optical and infrared data, also measuring polarization, I know that if I just pick out that subset of [indiscernible], which are polarized in the radio, they'll have a median redshift of 1.9. If I take the subset which is unpolarized in the radio, they'll have a median redshift of 1.1. So I can immediately divide our sample into two sub-groups. So if I want to look at, let's say, [indiscernible] effect, I can look at how -- I can do cross-correlation between the CMB and our radio sources and measure what dark energy was at 1.9 and 1.1. There ought to be bugger all if this model's right. So in that sense it's a bad example. But you see the principle. You don't actually need to know about every galaxy. You use simple diagnostics like polarization or spectral index or the radio K-relation or even a non-detection is giving you information. And so I call this approach statistical redshifts, and so we're also exploring this. What can we serve out the redshift distribution and other properties of the population as a whole. Okay. Now they got to it. Almost run out of time. Mining radio survey data for the unexpected. Right. We're going to find things in our data that we didn't expect to see there, we hope. Oh, by the way, if you didn't know what WTF stands for, it's widefield outlier finder. This is the name of the project and it's to mine the data for the unexpected. So astronomy does not work by testing a hypothesis. You know, the old thing we talk to graduates what papa said, you have a hypothesis, you [indiscernible] test the hypothesis. That's fine. Works really well behind geophysics. Most of astronomy is not done like that. So you look at the HR diagram, Hertsman and Russell went out, they decided to correlate these two constants. They 8 found the main sequence and from that we found out about [indiscernible] evolution. And if you look at the Nobel Prizes in physics over the last few years, on this plot, the black ones are the ones where people have just found something unexpected in their data. The white ones are where people are testing a hypothesis, ala Kyle Popper. And you can see we actually have gotten 11 Nobel Prizes from stumbling across stuff, compared to 7 where people have been testing hypothesis. So astronomy, most discoveries by going to large areas of unexplored parameter space and seeing what's there. It's a voyage of exploration, not testing a hypothesis. Okay. So let's take an example. Jocelyn Bell, discovery of pulsars. What happened? Jocelyn Bell, part of a Ph.D., she's looking at a new [indiscernible] space. She's looking at high time resolution data and she's looking for scintillation. And she found these bits of interference, and she's told -- you've got folks in your thesis, don't worry about that interference, it's probably a [indiscernible] or something. But she did. She was bloody-minded, she was persistent, obstinate, and she kept on following up these bits of scruff and then she found, yes, they are occurring at the same side and time every day. And so she discovers pulsars and her supervisor got the Nobel Prize for it. I won't comment on that. Anyway, but you look at the factors that went into that discovery. Firstly, she was exploring new area of observational phase space. She knew her instrument really well. She could tell the difference between rubbish on retractor and something that was astronomical. She's observant enough to look at the other things. She's open-minded, she's prepared for discovery. That's important. Some people aren't. She's in a supportive environment. People can be expected to make discoveries. And she's bloody-minded and persistent. Okay. So let's look at what we're going to do in EMU. Okay. We've got the stacked parameter space. No problem there. Other discoveries uniformly distribute across the diagram. Well, Occam's Razor says well, in principle, 9 they probably are. You've got no reason to say they aren't. be lots of good discoveries up there. So there should But what about the difficulty of finding them? Are they equally easy to find? Well, no. Because up here, we get into large volumes of data and that's where you get the problem. So firstly, nobody's really going to be sufficiently familiar with the instrument. [indiscernible] arm's length, we're using these very sophisticated software tools to analyze our data. So we will answer the questions we're asking really well. You pose the question, we'll design the software to answer it. What we won't find are the things we're not looking for, the unknown unknowns. So the question is, can we mine this data by looking for the unexpected? Let's skip that slide. Well, one thing. If we don't try to do this, we're not giving the maximum bang for our buck, and people like the NSF won't like us. Well, we're in Australia so that's all right. ARC in Australia. >>: We have a specific instruction against using things like WTF. >> Ray Norris: >>: You mean the acronym? The implication. >> Ray Norris: Right. Okay. So we're actually planning a project called WTF, where we are going to systematically mine the EMU database, discarding objects that we already know about. Most of the things, of course, we find will be artifacts. That's okay. They're good for quality control. We'll have the [indiscernible] results, of course. We've got and then we'll have a few new, genuinely new class objects. So we're in the process of building up, figuring out how we're going to do it verse pushes, decision trees. We've heard all of these today. KFN is the opposite of KNN. And probably end up using an ensemble approach. So EMU is an open project. Anybody here is welcome to join, and we'd really appreciate help. If you got ideas on how to do this, we'd love to have you join the project. And we hope that these approaches will be useful for other surveys. So later this year, I'm hoping to put out today's chat if anybody 10 wants to have a go and see what we can dig out of the Atlas data. And I'll finish there. >> Yan Xu: Thank you. Questions? >>: There is a very slight conflict between the WTF program and the throwing away of the UV data. >> Ray Norris: >>: Yes. Right? >> Ray Norris: There is. >>: So I mean, inasmuch as -- I just hate to give somebody like this, which is so awesome, any kind of advice. But if hold your UV data for as long as possible, more statistical [indiscernible] of it or something, it's really going to be following up. running a project you could possibly samples of it or valuable in your >> Ray Norris: No, we'd really like to. There's the problem, that [indiscernible] 12 hours a day [indiscernible] takes about 12 hours, and he needs a supercomputer. You're not going to download stuff on to your Mac and play with it. Nevertheless, your point is right, yeah. Ideally you would. It's just finances. >>: You don't need same time resolution, though. slices. >> Ray Norris: >>: How do you know? You can [indiscernible] in How do you know you don't need -- Compare your transient phenomenon to what was there before. >> Ray Norris: Yeah. For VAST, that's true. If we're in a funny part of parameter space, yeah. I'm very wary of making -- I mean, I sort of agree with you, but I'm wary of making these assumptions. Quick question? >>: Yes, just to conserve [indiscernible] and they have petabytes of hundred thousand or something, but [indiscernible] petabytes. 11 >> Ray Norris: I think it's different between buying disks and having them in a data sensor and rate arrays and mirrors and the rest of it. >>: It just adds up. >> Ray Norris: >>: And then you've got to power it for the next -- >> Ray Norris: >>: Yeah, and the speed of access, of course. Okay. Oh, yeah, power. Let's move on to our next speaker. Chenzhou Cui. >> Chenzhou Cui: So in my talk, I will give a brief overview of our work during the last several years. So comparing to the [indiscernible] and the research here, activities in China are still in very early stage. So I hope you can understand our progress. First, this is our understanding for the virtual observatory. The virtual observatory is a data intensive online astronomical research and education environment, taking advantages of advanced information technologies to achieve seamless, global access to astronomical information. So China VO project is the national VO project in China. It's initiated in 2002, just ten years before, by the chinese astronomical community with recommendation of Jim Gray, who became a member of the [indiscernible]. We mainly focus our efforts for the following fields. So first one is construction of a China VO platform and to provide unified access to online astronomical resources and services. And we hope to collaborate with national and international partners to make these products to be VO enabled. And we hope to collaborate with astronomers, especially young astronomers to using real tools and services to show the power of the virtual observatory. And, of course, we will do our best to use real resources to do public education and outreach. We are very proud that during the last ten years, we organized the national-wide VO workshop each year. During the last ten years, we had ten workshops have been organized. You can see the attendance number. And since last year, we limited the number of the workshop, but the workshop is 12 so large, it is hard to control so it is [indiscernible] topics. China VO is an active member for the IVOA. During the last several years, we [indiscernible]. So first, IVOA small project meeting in 2003. And in 2007, we hosted the spring interoperability of IVOA. After about a ten years, the China VO team, the members of the China VO team came from nationwide universities, observatories and information technology institutes. For example, NAOC, central China university and the Tianjin University and the university and the Kunming University of science and technology are a part of the China VO team and we collaborated with many international partners, including Johns Hopkins university, Microsoft Research, Cal Tech, [indiscernible], ICRAR, [indiscernible] and other partners. As we said before, maybe yesterday, so the goal of the virtual observatory, the basic goal of the virtual observatory is to provide the seamless global access to data resources. So data access services are also basic goals are task for the China VO. So [indiscernible] is Chinese astronomy call datacenter, we hope to connect the nationwide astronomy [indiscernible] in China and provide uniform access to astronomers. Currently, we're hosting the following data sites, including several [indiscernible] from Chinese telescopes. For example, the LAMOST pilot data survey, LAMOST commission data, and the collaborating with Stuart observatory, the south galactic cap U-band sky survey and the CSTAR is a small telescope [indiscernible] operating in Antarctica observatory. And coming two or three years, there will be more sites coming from Chinese telescopes. And additionally, we mirror the popular [indiscernible] from international partnerships, including the CDS Vizier database, the Sloan SkyServer, 2 MASS. Just yesterday, I got the disk from [indiscernible] very soon will set up a mirror site in Beijing for the Chandra. For LAMOST pilot sky survey, LAMOST is [indiscernible] photographic sky survey telescope, similar to Sloan Digital Sky Survey. From last year, October, to the spring of this year, this telescope observed about 300 plans and will get about a half a million spectrums, most of them [indiscernible] spectrums. For this sky survey, we provided different data access interfaces, including the web form, VO interfaces and the command line interfaces. You can [indiscernible] the observations and you can search just like using 13 Vizier system and there is a [indiscernible] of the spectrum. The output can be displayed using [indiscernible] or top cat. And it seems quite a part of the [indiscernible] are selected from the Sloan Digital Sky Survey. So [indiscernible] the Sloan data will be displayed. And you can submit your query, SQL query to the database and get results. And during the last several year, we developed the several small tools. This is a small tools converter for open office or open [indiscernible] to display real table files. And there is on screen capture, screen would capture if you're reading an article, you can select some words and then the keyword will send back to the [indiscernible] VO database and you can get a result for the character. And there is small tool for database administrator to archive a lot of [indiscernible] into a database. And we based on the [indiscernible] and developed a unified data access service for virtual observatory. And there is another one for file manager for [indiscernible] files. Based on the real [indiscernible], we developed a plug-in for MATLAB and then used MATLAB on a real table [indiscernible]. We can use MATLAB as a working bench for data mining on the [indiscernible]. Using the platform, we got to the first scientific paper from the China VO. We discovered candidate milky way satellite from the Sloan data. And there is a work from a partner, wait and see technology does like a used high energy of [indiscernible] data size and then provide users to just like the concept of a cloud. Tsinghua University, to use traditional remote desktop on the integrate frequently packages to their server and the popular access the server and [indiscernible] work, But this work was done about ten years ago. And there's a very pop -important work in China, the e-VBLI network infrastructure. Currently, this VLBI network composed four radio dishes in Beijing, 50 meters. And in Kunming, 40 meters. Shanghai, they are 25 meters. This is the topology of the e-VBLI network from China. Four stations linked by the Chinese science and the technology network. This is a multiple purpose E-science oriented network used by Chinese astronomers all at the same time. This network was so deep space mission from the Chinese government. For example, lunar inspiration from China will use the 14 network for orbit tracking. And by end of the year, there will be a new component, a 65 meter telescope will draw the network and four years later with the completion of a fast of 100 meter telescope we'll draw into this new network. Chinese astronomers has finished thorough GPU-based high performance simulation work. This is an example from the NAOC. At NAOC, we established a GPU cluster. The name Laohu is English tiger. This cluster is used by students and staff from NAOC and other Chinese and international collaborators. The PI of the cluster is professor Rainer Spurzem from Heidelberg University. And these clusters [indiscernible] the users has been [indiscernible] 100 currently. Various topics completed throughout this cluster. This is a hardware system for the cluster, and this is the performance of this one. The total speed is about 157 teraflops, and we have 85 nodes, each one with two NVIDIA Tesla GPU cards. And the BOOTES project is a worldwide network of robotic telescopes, led by Institute of Astrophysics and the Institute of Astrophysics from [indiscernible] from Spain. So we became a partner of this product last year we hosted the first telescope of this product last November. We began to build the observatory. And just before Christmas, we completed the hardware. And after the Chinese new year, we got the first [indiscernible] and in March, we organized an opening ceremony. So this observatory is a 16 meter small telescope, but once the hardware and the software system, we can get rich information from this observatory. You can see the outside view of the dome, and the inside view of the telescope. These images for our sky camera, one picture, one minute. And there is a lot of image for the full sky camera. And you can get the, observe the [indiscernible] in realtime. And you can see the observation log. All the information for the observatory can get from the console of the control software. So in China, we have strong requirements for robotic telescope, and the astronomers show strong interest for the time domain astronomy. You see, we have products to build Antarctic observatory and Tibet observatory. 15 And we have station at Argentina, and lunar-based astronomy and we have some international projects. For example, the SONG telescope network and the SVOM from France. And for education and amateur observation, robotic observatory are very useful. So based on experience and interest of the China RAON, we initiated our idea to provide China robotic autonomous observatory network product. This is not to build a specific telescope but we hope to provide technical support and a solution for Chinese users and we hope to help astronomers and teachers to build their robotic telescopes. For the Bootes-4 telescope, we hope to do some further development. For example, to provide VO-event triggered function for the observatory. And collaborating with our partners here to develop full automatic photometric pipeline and full automatic archiving system and a VO-compliant data access and maybe even very hard for automated event classification. And the next part I hope to give you some more slides for our education and the public outreach activities during the last several years. This is a broadcast of our total solar eclipse in the international year of astronomy. We organized a large scale broadcast of the total solar eclipse. And along the solar eclipse belt, we deployed 11 observation stations. And using satellite and the NASA generation network and the IP basics network, we provide signals to our clients. Finally, we got very good results. For example, CNN, ABC, AP, and this is from Poland, yeah. And many Chinese TV stations using our live broadcast streams. So our live broadcast is advertised by -- at the front page of the ABA 2009. This is the results for this live broadcast. During the last [indiscernible], we established the close collaboration with American south research for the worldwide telescope. In 2010, we organized a nationwide guided tour design contest, and about 200 tours are collected at the ceremony, the director of the American south research Asia and the president of Chinese Astronomical Society attended the awards ceremony. And during the last [indiscernible], we give lectures on the classes at 16 different places on the different universities. Two weeks ago, at the General Assembly of the IU in Beijing, there is a worldwide telescope booth and many audience attended, audience visited the booth and we got favorable feedback from the visitors. And limited by the manpower and the resources, so it is impossible to teach students one by one. So we train teachers and ask the teachers to teach their students to [indiscernible] our impact. So we organize several nationwide and regional trainings for the worldwide telescope. And our last work before workshop is we provided the WiFi service for the IAU General Assembly, because the system of the commercial center cannot provide so much connections at same time for WiFi connections. So collaborating with one company, our team set up the WiFi service. You see the log. This is a connection number and this is bandwidth. So you see the largest connections occurred on the first day, but the member is now 190. And you can see the bandwidth occurred at the second week on Wednesday. So also, this scale is not so much, but this is our first time to test our team to set up this size scale system. >>: Comment? Assembly yet. I think the WiFi at Beijing is the best we've had at any General >> Chenzhou Cui: Thank you very much. I stop my talk here. So I'm so glad to work with my colleagues, [indiscernible], three beautiful ladies. Thank you. >>: So while the next speaker sets up, do we have one or two questions? >>: I went and checked out astro box on the web, a MATLAB thing. It looks like it's a commercial product that you have to pay for. Is that correct? >> Chenzhou Cui: products. >>: We didn't provide the commercial That's how good it is. >> Chenzhou Cui: >>: No, no, Open Source. Maybe so. Maybe you visited a different website. I couldn't find a free download. 17 >>: So could I have ten seconds on the lunar observatory that I saw? see a lunar observatory? >> Chenzhou Cui: >>: Solar eclipse. Astronomy from the moon you listed. >> Chenzhou Cui: Oh, lunar? >>: Yeah, you had a slide on all the different telescopes. telescopes. >>: That's next year. >>: Okay. The robotic I would support such a program. >> Chenzhou Cui: >>: Did I Thank you. Building all these telescopes, why don't you just buy some from the NSF. >> Chenzhou Cui: Maybe next year, there will be a small telescope shaped, launched to the moon from the Chinese ->>: Oh, that's what you're going to do is land a small telescope on the moon? >> Chenzhou Cui: >>: Very small, about 15 centimeters. What if it lands upside down? >> Chenzhou Cui: I don't know. Well, next year, there will be the [indiscernible] 3 satellite launched. There will be a very small telescope, just for a test. >>: We probably shouldn't run -- let's thank him. Kurtz. Our next talk is by Michael >> Michael Kurtz: I'm going to give sort of a case study of how we came to where we are. This project is ongoing. We haven't finished some of the 18 fainter parts of it. picture. It's another year or so. But I'll end with the current My collaborators the whole time have been Margaret Geller and Dan Fabricant. In the beginning, John Hukra. Around the middle and up 'til now, Yan Delantonio of Brown University and for the last few years, Hitoshi Mizaki of the National Astronomy Observatory of Japan. Well, all right, let's start with in 1986, that's what redshift surveys looked like. That's actually not true. These dots are colored by the galaxy types and that data didn't exist for another four years. I made this plot in 1991, as soon as it was possible for, I think, SIGRAF, but it's appeared very many places now. But certainly, at that time, we were thinking about what to do next. Also, in 1986, that's a picture of the Arizona mirror lab, and it was being conceived then. It actually had already been conceived, and discussions were started as to what to do with it. In particular, their goal is always to build eight meter mirrors. Should they build a smaller one on the way that would fit inside the MMT building. And pretty soon, the answer was yes, and it should have a wide field. So starting about 1986, we, of course, thought about doing one of these big strips deep with a six and a half meter telescope. All right. Well, now this is 1987, a year later. This conference in Garching is actually very interesting from the history of this field. I spoke there on classification methods. And at this point in the talk, I'm going to stop and give an ad for our sponsor anyway, my sponsor, which is ADS. ADS started at this meeting. I gave a talk in which I showed, among other things, that you can take digital spectra, treat them as vectors and do principal components and get a classification space. Anyway, somebody saw the talk and came up to me and said that, in fact, that's how you classify text. And that's the start of ADS. ADS didn't happen for another six years, but it was being worked on from about that time. So the ADA ad is that ADS now has probably 3 million full text articles. Every article really ever published in a major journal in physics and astronomy is in 19 ADS full text now and people who are interested in data mining should see us, should figure out how to use it. It's probably the best collection of scientific full text that exists anywhere in the world. And we're extremely happy to have collaborators. We're building the APIs now for them and we need people to help us, tell us how to build it. All right. Well, back to the talk, in Garching, that's one of the paragraphs I actually wrote in the proposal in the paper. I'm talking about, you know, how you classify voids and bubbles which are only just now found at that time. Two years before, nobody would have said that, because nobody knew they were there. Anyway, the last sentence or so I talk about the microwave background fluctuations which haven't been discovered yet and whether or not there's going to be some correlation between these bubbles and voids and them and that's, of course, what the boss scale is, and that's been found and is, indeed, part of what we're doing with the survey that I'll end with, HectoMap. Well, all right. I don't know whether you can see any part of that, but it gets to where the problems are in doing this. Back then, there were not galaxy catalogs, really. The first map was done from the Zwicky catalog. Zwicky, of course, looked at the POS plates by I and wrote it all down. This is POS 1 digitized. Look here. And look here. Here, you can see something. And here you can really see nothing. But I'll show those things several more times in the talk. So if you sort of remember what that looks like. The thing to do at that time was to digitize these plates. I met George while I was at [indiscernible] digitizing plates, if I remember, about that time. And so we digitize them. We went out to the MMT and observed them. This is the map we finally came up with. That's several years of what we could get from MMT time to do that. That was finished in the mid '90s or something. The century survey is what's that's called. That's where the original slice is. This goes out to a redshift of 0.15, I think, out there. Anyhow, that's what you could do back then. And, of course, we knew we were looking for something better. Well, now let's go forward to 1991, three more years, four years. And it had already been decided that we were going to build this telescope. The glass had already actually been bought from O'Hara for the mirror. The spectrograph had been sort of designed. This is where we figured out the algorithm for how to put down the fibers. 20 It turns out that we discovered you don't need such a huge angular motion in the fiber positioner, which made it much easier to actually build the fiber positioner. So that's all algorithmic work that was done getting ready for the thing. This is, I showed that plot at the IAU in 1991. This is the abstract of the talk I gave, which was again on spectral classification of stars. I wag still considered Mr. Stellar spectral classification automatic at that time. Although I hadn't worked on it in, then, nine years. And here were a couple of things. First, the spectrograph didn't have a name yet, I think. We thought it would be ready in 1995, four years later, that it's still going to be 300 fibers, that's true. We thought you could get a thousand classification quality spectra in an hour for bright objects. Turns out we never built the grading for this. So we never really did stars with this instrument. The hyper velocity stars project we do, but very few other star projects. But anyway, individual researchers can get on the order of 100 thousand spectra per year, which will change the field. And, in fact, individual researchers, small groups can get 20 or 30 thousand, and that's basically what I'm going to show at the end of the project, the talk. All right. Well, let's go ahead now three or four more years and if you'll remember, we thought this would be done then. Of course, with it was still four years from being done, four years later. So I did this, which is to describe the whole process of how these things are done. You start basically by having some large body of knowledge, databases and surveys, and then you build on it. Somebody has an idea, a new technique is developed. You use it. It works out. You get more and better. You know, the known is that way and the body of knowledge is this way. Basically, it works like this. You've got a body of knowledge. Somebody invents something, does it. It succeeds. They come back, they do more of it. Somebody else does some. The guys who did it build on that. Other people build on it. And eventually, somebody comes around and builds a survey that completely closes out that observational space. There's no real reason to observe that stuff again. It's done. Like, you know, basic galaxies with the Sloan. Right now, who would ever observe them again? 21 Time series, other things, it's a whole different dimension. whole dimensional thing. >>: See, it's this Check the theory, basically. >> Michael Kurtz: Yeah, right. I hadn't thought of that, but that's true. Anyway, so this is how I viewed what would be true. The Sloan survey hadn't taken its first photon yet. In fact, up in of these things had. What I thought would be true somewhere, you know, five years ago now so ten years, 15 years in the future from when I did it, there were all these deep probes. The Australians had one 2DF probably should have been that, and that's now wiggle Z or something. But those names didn't exist back then. The Hectospec survey, we thought, would be bigger and more done than it actually turns out to be. Deep existed, et cetera. The guys at ESO have several surveys now, none of which are even considered at the time I built this diagram. But anyway, all right, so that's sort of what we were doing then. And now we'll get back to the problem at hand, which is how do you go and feed a six and a half meter sky survey for spectra? Well, this is now the POS 2. Yeah, this is probably unseeable, but you can see there and here, they're the two regions I'm trying to have you look at. And if you could actually see this, it looks a little better than it does on the POS 1. >>: White stars on black sky, you call yourself an astronomer? >> Michael Kurtz: Well, when I made these slides, it was for a dark room. Besides. Anyway, the nice thing about this is that somebody else had a measuring engine and mate a catalog. So we didn't really have to go and scan them anymore ourselves. And because of that, and because we now had CCD spectrographs instead of redicons, a 60-inch telescope could make this survey. This not only goes out to redshift of a tenth, it's the same region. What did we call this? 15R. It goes out to 15 in big R. Well, all right, so that's what we sort of did. We're still looking to do these slices. We're still thinking about how to do it when we have the six and a half meter telescope. But it's still four years away. So what do we do? Well, we use that data to simulate what we have when we had 22 the big telescope, of course. And one of the things you have to do is you have to figure out how to automate the reduction of the data. If you're going to have so much data. And this is our blender diagram. This is the actual data from the survey I showed you, plus other regions of the sky. And this is the blender area. Every dot there is something that you would not get automatically by the most stupid procedures. And, indeed, we had to fix these. This turns out to be objects where nitrogen is brighter than H-alpha. These up here, they can all be understood, but right algorithms had to be changed so these things all went away. That one was a difficult one. And this one turns out to be two different galaxies on top of each other. It's real. I won't explain what this diagram is, how I exactly got it, but the blunders are up here and you need to do stuff like that. Well, all right. Three years later, four years later, they still -- it wasn't four years away then. It was only three years away. So what did we do? We decided we'd model the sky subtraction and improve it. So we improved the sky subtraction. This shows how it is for normal things you can get 80 percent completeness three magnitudes below sky using just very simple techniques that are what we would actually have with the survey that we were planning. Okay. Now, three years later, it was ready. Three years later, the telescope was built. We were now 2003. So we've gone forward in time now 15 years from the beginning and eight years from when it was originally supposed to be done. Well, this -- that's the galaxy, those are the galaxies I was showing you, and that's the galaxy that you could never have seen on the other ones. This guy is bright enough to be in the survey, these guys all are, but that guy is not to show you what I'll eventually show you so you see what it is. But anyway, we never got it together to do the drift scans, to go and get our own catalog to do this survey. Because it was so late, there were whole areas of the space that were taken by other people. Deep existed at that time. The Sloan was already taking data at that time, which we'd have beaten them if he'd have been on time, but that's true with everybody. So what we did is we got this as part of the deep lens survey. This is five by three arc minutes or so. The deep lens survey. This region is four square 23 degrees and we had competitors, we had Arizona and CFA who were doing something with the NAO deep field. So basically, we didn't do the survey at that time that we were planning. But we did this instead. We did this because we were interested in weak lensing and looking at the weak lensing properties in the deep lens survey compared to the spectroscopy, deep spectroscopy. All right, well, I won't show you any of the results, but this is a noise diagram. This is a measure of signal to noise. And this is the brightness of the galaxy down the fiber. It turns out that 21 down our fiber in big R is almost exactly the same as 21 fiber magnitude for Sloan little r. So at that level, in an hour, you basically get everything. One of the things about this is that almost all these things down here were taken with moon, and we were figuring out exactly how much moonlight we could have and still do the survey. The next thing that happened is the Sloan DR-5 came out when had this region. What this meant for us is we didn't have to collaborate with the people who were taking the images. We could just do spectra of regions that we wanted to. Here, you'll see the little guys here. This guy is perfectly visible in the actual Sloan. That guy is fainter than you're going. That one's about 20.2 or 21.2 or 21.3. The Sloan is perfectly deep enough for a six and a half meter telescope for almost every healing you'd really want to do. It's not deep enough for four, five hour integrations. But for our integrations, everything is in the Sloan. So this freed us from having to deal with collaborators who do the imaging. Okay. So the Sloan had colors. So we did this. This basically is a photometric redshift paper showing that you can do surface brightness versus color. This is color percentiles as a function of surface brightness. These are the red ones. Goes that way. That's the bluest ones. And it works pretty well. You don't really need several colors. A couple just work fine. This is just R and I versus R surface brightness. You get rid of most of the outliers if you have G. All right. Well, we finished that and thought of what to do next. The lensing project worked pretty well. So this is a lensing map done by the Japanese on 24 the [indiscernible] telescope and we figured we'd just observe that region. did. And we didn't have to ask them. We didn't have to do anything. They published their results already. We knew where their peaks were. We We, of course, made the mistake of asking them, and now we have more collaborators and I'll show you that in a moment, but the nice thing about the Sloan is that we can just do this. None of this had to be arranged in advance. We could call them up and say we have a redshift survey over your region. Do you want to collaborate. Not do you want to collaborate, give us stuff. And that works a lot better. Okay. So this is the survey itself. It's a film. This is the old CFA redshift survey from 1986. It will disappear in a moment. This is 0.5 in redshift. That's the Sloan. This is what we did with the Japanese field, and now coming in, this is to speed it up in time. That's the first year of observation after the Sloan. So 2009, I think. Then '10, then '11, then '12. And that's the final map or will be very shortly. Here it comes. This is amazing if the film runs to the end, which it doesn't in PowerPoint, which is the new buzz word for the Smithsonian Institution. This film is actually going to be shown at the Air and Space Museum real soon now. It may actually already be, but we only made it last week. Okay. So to conclude, that's the hyper superime cam, and it turns out the hyper superime cam is now going to, as its engineering time, image the strip that I just showed you. So things come around. Now, the guys doing the photometry are doing the photometry in the region where the spectroscopy is already done. So that's all I wanted to say. Thank you very much. >>: Can the next speaker come up while we take a few questions? >>: So what area do you actually cover and how many objects have you got spectral ->> Michael Kurtz: It depends on the deeper the shallow survey. The shallow survey is 50 square degrees. The deep survey is 35. There are currently about 60,000 spectra. We'll go up to about 75 or 80 thousand. >>: How many bands? 25 >> Michael Kurtz: They're spectra. >>: Oh, spectra. >>: Is the data public? >> Michael Kurtz: >>: Are you going to put it into -- >> Michael Kurtz: >>: No way. Some cloud -- >> Michael Kurtz: >>: No. Absolutely not. Why not? >> Michael Kurtz: Screw that. It's ours. I know that I'm making fun of this, but, in fact, we're a small group and we really can't afford it. >>: Okay. >> Michael Kurtz: We are the first data is being, you know, managed now and published now. But we don't have anybody to actually deal with it. The person who, say, would get together a data paper is me. I'm busy it's just not on my list of priorities. >>: Have somebody write your paper for you. >> Michael Kurtz: To some extent, I agree. But I'm not all the people in the collaboration either. So of everybody, I'm the most public one. And it's not going to happen. >>: Let's thank our speaker again. >> Nicholas Ball: All right. I see we're running a little late so I'll try and finish relatively quickly. So I'm Nick. I'm from the Canadian Astronomy Datacenter. Now, well known as the source of 39 million Sloan queries back in 2008. 26 >>: Was it you? >> Nicholas Ball: All I can say is that it wasn't me. So these are my collaborators, David Schade, head of CADC, Alex Gray, who will be here tomorrow, and Martin Hack, who is also involved in Skytree, and other people. So just to give an idea who I am, you can divide people up into data miners who do astronomy, astronomers who do data mining. I am an astronomer who does data mining. So I will talk about CANFAR. Then I'll talk about Skytree, then I'll talk about CANFAR and Skytree, give an example of using it, an example of science, if there's time, and conclusions. So CANFAR is CADC's cloud computing system. It's pretty unique in being a cloud computing system for astronomy. The idea is that it provides a generic infrastructure for processing of, for example, survey data, and it also provides storage. The basic specs right are there that are 500 processor cores, up to six processors and 32 gigabytes of memory per node, and that's going to increase to 256 soon. Storage is provided by the VOSpace system, and there are several hundred terabytes of storage available. A key point is that this storage can be accessed as a mounted file system, so it's just the same as having another directory on your machine. The user uses CANFAR, they see a virtual machine, and you just see it as another terminal. So you SSH to CANFAR. Once you're there, you can install and run any software that runs on the Linux system, and that's essentially all astronomy code. And that's a very important thing. CANFAR was set up so that astronomers can run their own code, which a lot of astronomers like to do. And once you've set up your virtual machine, you can then run that interactively if you'd like, or you can run it in batch via Condor, and batch is good because that gives you access to 500 cores. So that's CANFAR in a nutshell. Skytree is a software system that is designed to be the first large scale machine learning software that is industrial strength, essentially. Their 27 philosophy is rather than try to implement thousands of data mining algorithms, they implement seven well known algorithms and they do it well. And a lot of data mining algorithms can be reduced. Maybe reduced is the wrong word, but there are aspects of these algorithms. The key thing about Skytree is that it's fast. The implementation is always scaleable so things that are naively N-squared, for example, say nearest neighbors scale linearly. The system is robust and there are published papers showing the accuracy of the speed-ups of the algorithms. This comes out of the FASTlab of which Alex is also head, which is academic. The key point is that it allows for publication quality, scientific results. So as I say, it has an academic and astronomy background. In a sense of an astronomy background, the sky in Skytree stands for the sky, and tree stands for the data structure. It's designed to work on the command line, and it's designed to be called as a part of a more general analysis. So it is well suited to CANFAR, because CANFAR is a batch processing command line system. And you input for example, ASCII data, and you visualize the results elsewhere. >>: Is the primary a library that you call from C-code? >> Nicholas Ball: Yeah, it's a machine learning server. So yeah, you call it from whatever code, or you can just run it interactively from the command line itself. So it's designed to provide -- it's essentially, you can almost think of it as like Unix commands, except now you have machine learning commands instead. So it's designed to be part of your analysis. So these are the algorithms that it implements. Just do this quickly. AllKN is all nearest neighbors. And it goes from N square to N. It does kernel density estimation with the same speed. This is N objects. Two-point correlation function, again, and then the speed-ups here are more complicated, but you could have things like D-cubed to D, where D is the number of dimensions and support vector machine is more complicated. But it implements SVM linear regression, singular value decomposition includes PCA. So it's dimension reduction and also K-means clustering. 28 So CANFAR plus Skytree, this is a powerful system. CANFAR on its own is powerful. It's like having a supercomputer. Skytree is also powerful. And when you combine the two, you get a good system to work with. As I said, you can install your own code on the virtual machine, and you have access to VOSpace as a mounted system. So this essentially follows the paradigm of taking the analysis to the data, because the analysis is in can far, which is in CADC, mostly. And the data is on VOSpace. So the argument for having Skytree in this way is analogous to the argument for CANFAR itself. CANFAR itself is a generic infrastructure, that saves you having to reinvent the wheel to do your own data analysis. And Skytree does the same thing with respect to performing data mining on large data. So I'll just show you quickly, as an example of say you want to actually use it, what do you do? To start with, you request a CANFAR account. You also get a CADC account, which is trivial. It's just like any other website, login and password. Then once you have your account, you SSH over to CANFAR to create a virtual machine. You VM create the virtual machine, then you go through it, and then say you want to install some software. If you want to small Skytree, you just download it and untar it. And the only other thing you have to do to run Skytree is a license server so that you can then run it in batch in multiple instances. Typical call to run Skytree looks like the following. It's just a command line with argument. So here, we're running the nearest neighbors. This is the file that's going in. It's an ASCII file, and it's slightly modified to the Skytree format. But that is all very simple. And if there isn't an existing script to convert your data type, then they'll help you write one. And then there's just some arguments here and then some outputs. And in this case, it's outputting the distances to the neighbors. Then you do whatever with your results. And you can run this entire interactively on your single virtual machine, or you can run it in batch, up to 500 at once. So a key part of this is that you don't just work with a virtual machine. You can also work with us, the people. The aim of the system is to enable better science. If you have a problem and you maybe want to see how to solve it with data mining, if you can, then we'd be happy to work with you to do that. 29 And my background is astronomy, so if you send me an astronomy email, I'll probably understand it. But I can help suggest data mining. A key point which I should have put on here as well really is that if you maybe need something more advanced in terms of algorithms and if I don't know the answer, then the Skytree and the FASTlab are world experts in machine learning. So if I don't know the answer, maybe they will, and they can be part of a consultation too, because both the FASTlab and Skytree are very keen to work with astronomers. And as I say, people like Alex have been working with astronomers for 20 years now. So looks like I have time to do this quickly. So my science interest is luminosity function, large scale structure as well. An interest in this you look, for example, in Eric's SCMA book over there, the first chapter estimators for the galaxy luminosity function, so there's a whole astrostatistics interesting side to this as well. And I think there's a potential there. galaxy is if is new lot of But in this case, if you want to do luminosity function, you can consider doing Photo-Zs, and if you make Photo-Zs, they have great legacy value so that's what we've been doing. Skytree nearest neighbor allows you to produce Photo-Z PDFs, and that's detailed not with Skytree, but it's the same method in this paper. And because CANFAR lets you run any software, you can run the template based on 500 cores as well. Like many things, the consensus is that Photo-Z is best done using more than one method and comparing them. So that's what we're doing. And we did this, we ran the neighbors for the CFHTLS survey and 130 million galaxies. And the way we generate PDFs means we generated a hundred instances of each galaxy, and we did that because it enables us to demonstrate creating a catalog of size billions of objects, storing and handling that catalog and then performing data mining on that catalog. In this case, nearest neighbors to get Photo-Z. So we have processed analysis T-sized dataset. And that's no small thing, because you're silting there with hundreds of cores, virtual machines, VOSpace and it all actually works. So it's quite nice and that's just an example of a Photo-Z PDF, showing that it's general, it's not Gaussian or anything. So another thing we wanted to show was that Skytree scales, as they claim, on 30 real data. You really get N-square to N. And another thing, this is in progress, but we also want to compare it to Open Source alternatives, for example, R. So we're looking at running various algorithms on various large catalogs and that's in progress. You can tell we're Canadian here, catalogs. So, for example, those, and we want to do -- you can do useful things with that. For example, founding outliers. So I could add to this list now, of course, EMU, which we heard about earlier. 70 million. That's the same sort of size when it's done. So a couple of quick plots just to show what's in progress. This is run time versus fraction of a dataset. The dataset being 500 million objects. And you can see that for the points we have, the run time is, indeed, linear. Actually produced error bars on these points, but the error bars were smaller than the points. So that's that. And then the next plot shows that at the moment, just doing this is memory limited. But once we have the 2.56 gigabytes, you can see we can clearly run this sort of size dataset and run neighbors or whatever on it and so what this has done is it's found the nearest neighbor to every object in the dataset. So to conclude, CANFAR allows storage, processing, and analysis. It's generic, and you can run your own code. Skytree is fast and robust, and it will allow publication quality results. CANFAR plus Skytree implements Skytree on up to 500 cores. You can combine this analysis with your own code, and you can store your data in VOSpace. And that's persistent storage. You can put it there and you can even have proprietary data there as well with permissions. So if you're interested, lots of ways to get started. Email CANFARhelp, talk to me, sign up by the poster. Look at my website, whatever. If you want to look at the website, I designed this page to be a -- there's a single go-to CANFAR for Skytree page that has all the information and all the relevant links on it. So if you're interested, that is the best place to go. Or just have a look at the poster. And we are very much encouraging anybody who is at all interested in this to have a go with it and use it, because it becomes much easier to justify funding it if there's a base of users. And on a more personal note, thanks. >>: And I might add the only speaker to stick to time. >>: Sounds like an outfit too good to be true. You have storage space, 31 computer resources, algorithms and expertise so you can use it. So essentially, like in no time flat you're going to be besieged with requests. How are you going to manage resource allocation when that happens? >> Nicholas Ball: Well, for start, it will be a nice problem to have. So far, we don't have many users. In terms of managing the resource allocation, well, jobs are queued by the Condor system so they're queued in a fair way as the same way as the super computing systems would be. If we become inundated by users, then it should be much easier to justify funding to expand the system, because the processing is already distributed over more than one site, as is the storage, so the whole system is expandable well beyond 500 cores. And if there's a clear demand for it, then expanding it should be possible but I don't know the details of that. >>: I can start running another millenium simulation there? >> Nicholas Ball: Yeah, in principle. So the way it works is that projects with Canadians are guaranteed access because of the way it's funded. Projects with no Canadians are done on a case-by-case basis. So far, we haven't said no to any astronomers. So if anybody here is interested, I don't see a problem. >>: You should advertise it on maybe the Facebook pages or for astronomers or astroinformatics. >>: I recommend he not advertise it. >> Nicholas Ball: I want to do whatever it takes to get a base of users. >>: Are you planning on adding other techniques, besides the ones you listed? For example, for nearest neighbor, something like [indiscernible] type things that have variable numbers of neighbors and so on? >> Nicholas Ball: Yes. So there's a few things listed on the poster that Skytree is hoping to put out later this year. Things like decision tree probably neural nets. Time series streaming is something they're working on and several other things. They basically say that if there's an algorithm that has a lot of demand, then they'll add it. The things like the neighbors already have a few refinements, 32 like different distance measurements, I believe. I don't know if they have [indiscernible], I don't think so. But if it's something that the user is interested in, they would be very amenable to adding that. >>: And if I have my home brew stuff I want to do with the output in MATLAB, would that work? Would you support MATLAB? >> Nicholas Ball: If MATLAB requires a license, I don't think we could pay for a MATLAB license. But if you could -- I guess if you could pay for the license, then you could small some CANFAR and run it. >>: The license is beyond control. >> Nicholas Ball: That was actually a significant thing here, the fact that Skytree is licensed as well, and it just seems to work fine on 500 cores. >>: About the [indiscernible] data. Skytree has its own format. A couple of related questions. One is Skytree would be [indiscernible] so that it would inter-operate with other VO service provider. But if you have a hundred million object catalog, you don't want to upload a VO so how do you get large datasets into the system? Have you thought about caching those datasets so others could simply access them from your data stream? >> Nicholas Ball: Yes, so the way I've done it is basically with CSV files and the Skytree format. Some of the algorithms you can put the CSV straight in, because they don't need anymore description. Some of them are just headers saying this column is a double. This column is a whatever and it's still ASCII, so it's relatively simple. You could even -- you could do it with an ALT command, almost. But yeah, in terms of caching data, you could certainly store the data on VOSpace and allow whichever people access to it. I don't know if there's a faster system than just the disk that the files are stored on in VOSpace, to you'd be accessing it off a disk, but that's fundamental to the CANFAR system. But yeah, you should be able to create whatever data and allow people to use it. >>: [inaudible]. 33 >> Nicholas Ball: If there was interest in being able to read VO table, yes, I would have thought they'd be willing to do that. >>: [indiscernible]. Moving the algorithms to the data, because they are still moving the data to [indiscernible]. >> Nicholas Ball: >>: Yeah. And not moving [indiscernible]. >> Nicholas Ball: Yes, your system is -- >>: This is a simple question. When you say if the user wants to solve the specific problem he can contact CANFAR or me, does it mean that [indiscernible] help only if it means you [indiscernible] on the paper because it always happens in this case. >> Nicholas Ball: So I don't really have many examples to go on so far, but yeah, it would be -- I would be some kind of -- I would be consultant, essentially. So I would assume whether I would ask to go on the paper depends on how much I contribute. >>: Same problem [indiscernible] find a solution. >> Nicholas Ball: Yeah. So yeah, it's a sort of -- this is -- my trying to sell myself a bit is kind of related to this, right. >>: But it's a standard problem, right. That part of doing all this is to parade your skills to the community. Maybe you can contribute to lots of different projects. >>: [inaudible] for instance wonderful work done by [indiscernible] is begun to [indiscernible] people are beginning to note that statistical information, data mining is the same. The only [indiscernible] people want to use these things without even making the effort, do you understand, what a neural network is or what a decision tree is. So at the end, they come up with the most absurd questions, and they are upset if you don't give an immediate answer. You say look, if you don't even know the basics, I cannot even start the [indiscernible] for you. 34 So this is what I'm asking the results [indiscernible] must be very careful, otherwise, would be submerged in much people questions. >>: Become a help desk. >>: They do no the expect to even give you any exchange. >> Nicholas Ball: know it's there. >>: Yeah, I mean so far, we're just trying to get people to like There was a paper on Astro PH a month or two ago. I saw one on CANFAR. >> Matthew Graham: -- we've been doing at Caltech, which is related to some of the other stuff that's been talked today, which is we've been playing around about looking at automatic discovery of relationships in astronomical data using a tool that's come out of Cornell, [indiscernible] group at Cornell called Eureqa, or it used to be called Eureqa. It was the subject of a science paper in 2009. It's now called Formulize. [indiscernible] is part of the same group, but doing something different. But the basic idea is the use of the technique of symbolic regression to determine best fitting functional forms to data, and it does this -- it does a combined search for both the best form of an equation and then the parameters of that equation simultaneously to fit the data. Now you, can specify the type of building block you want to use in the fit in terms of mathematical building blocks and algebraic operators and other function, [indiscernible], that sort of thing. And then there are more sort of advanced building blocks that you can use. And what it actually does is it uses an evolutionary algorithm to explore a metric space of numerical partial derivatives of your dataset or actually variables in your datasets and it's looking for invariants. And the idea is it runs through -- it's a genetic algorithm, an evolutionary algorithm, and it produces a final small set of candidate analytical expressions which are the best fit according to some metric that you've supplied, whether it's goodness of fit or absolute error or something. And it supplies a sit of final candidate expressions which vary from, you know, this one uses a small number of parameters and has -- so is less complex. But maybe doesn't have as good accuracy to something which is more complex because it may use more complicated 35 functions or higher number of functions. front of accuracy versus parsimony. And is more accurate. On this Pareto So go to this slide, if you're interested in the software. Unfortunately they've just changed it so that it's licensed above a basic version. But they seem to be quite amenable to discussion. So the sort of thing we've been playing around with, first of all, can we reconstruct relationships in the Hot Spring Russell diagram if we just get put in our dataset absolute magnitude versus color, surface gravity temperature and metalicity as you might find from spectroscopic sample of stars or whatever. Can we construct relationships. What we find is if we train it up on the blue data points, which are around about three and a half thousand stars from [indiscernible] which give a good coverage of the HR space, and then we apply it to some other data, we get some very good solutions on them where we get a median that's a new difference between the [indiscernible] and our [indiscernible] value of about 0.6 in MV, we get similar errors than when we apply it to segue data, which is the red stuff, and of the black points are from the rave DR3 dataset. So we're not overfitting the data in any particular way. It seems to be finding realistic relationships in the data in terms of fitting functions to the data. Similarly, the right-hand side is a plot of the fundamental plane of the elliptical galaxies. This is the original data, [indiscernible] '87. And again, you put in the data, you just, you know, these are the values. Can you find any relationship in the data. So it goes ahead and does that and reproduces the values from that. Possibly the more interesting one is that you can phrase the search for a fitting function in such a way that it actually becomes a binary classification operation. So these are the RR LYRAE and WUMA data sets that Sherri was showing earlier. Obviously, from the light you can't tell anything different between them. So what we do is characterize them in terms of about 60 different functions and then that forms the dataset that we put in to Eureqa, and it goes through and comes up and says the best fitting function which allows you to differentiate 36 between these involves just the period and, in our case, the median absolute deviation and it ignores the other 58. So it's doing feature extraction as well. It's not using all of the 68 features. It only picks those which are absolutely relevant. And in the case when you plot it, you can clearly see that this is very clean separation in terms of these, and that's because this is a section [indiscernible] diagrams of these two classes of object and there's a clear separation. So in these cases, we're getting purity and efficiency equivalent to the decision tree stuff that Sherri was showing. Similarly, when we look and see these blazars and [indiscernible] against [indiscernible], we get similar results as well. So it's interesting that this technique, which was originally designed for sort of looking for arbitrary or looking for relationships in the function space can then be used to identify for doing feature selection, feature extraction for purposes of binary classification. So there should hopefully be a paper out on this soon. minutes. That was my quick five >>: The complexity penalty more than just [indiscernible], or is it more something like a Bayesian? >> Matthew Graham: I don't think so. The complexity factor does have some effect on which solutions are selected, because you can weight different functions with a different complexity value. And that does fall in, in some ways. >>: There's no penalty on complexity? >> Matthew Graham: There's no penalty on complexity, right. You can impose an arbitrary penalty if you want, but you don't want [indiscernible] particular complexity. >>: [indiscernible] but going to a search space, did you find the complexity of the relationship [indiscernible]. The fitness function doesn't have that? >> Matthew Graham: No, the fitness function doesn't, but you could filter it 37 on. -- >>: You had a bullet about Pareto parsimony. complexity? >> Matthew Graham: Isn't that just another word for Sean could probably answer that. >>: All right. I'll start. Apparently, this is not working. We're supposed to discuss here sharing of knowledge discovery tools, and I'd like to show the slide here to show that at the moment, this responsibility of sharing is not a gentle, kindly thing where more or less equal friends share things that are more or less equal. It is extremely unequal. The people in this room know the meaning and use, often, the meaning of what's on the left hand column. And Google shows that there are millions, or hundreds of thousands of activities in here, in the world, in the industrial world, the statistics world, engineering world. And the right-hand side column shows that the astronomers are roughly five orders of magnitude behind, okay. Now, there are not as many astronomers. We don't expect hundreds of thousands of hits. But it would be nice to see hundreds of hits or even thousands of hits on occasion. You have to remember that astronomers ->>: Those numbers, they're not right. >>: I got them last year from ADS using -- >>: No, no, I can give you more data. >>: Very good. >>: Data mining has 3,000 hits. >>: Sorry? >>: Data mining has 3,000 hits in ADS as a phrase. phrase, it probably works. >>: It's an abstract. If you don't make it a 38 >>: If you're looking at abstracts, I just did it. The number is 2,944. >>: Something was wrong, I do apologize. It seemed to have been rather poor, and I'm glad to hear that it's not as poor. Thanks very much. The results of this, in my opinion, is that -- try random forest. >>: Do you want it as a phrase? >>: I only do referee papers. >>: All right. So random forest, referee papers, as a phrase. >>: Okay. So it's my impression that it's rather low. And I think what this means is that we have more than sharing to do. We have a great deal of education to do and a great deal of promulgation to do. Now in statistics, astrostatistics, astroinformatics or sort fields, the situation is more or less the field. The median statistics uses a very narrow suite of methods. Mostly 19th 20th century methods, and often uses them poorly and doesn't methods from the 21st century. >>: of sister-related astronomy paper in century and early know the latest 76. >>: 76, very good. Okay. I did do this a year and a half ago. So it might be that things have improved. So this could be killed. You can just -- so my point is that when we use the word sharing, we have a huge amount of sharing to do as a community of experts. I'm not sure this is right, but I sort of think that I'm sort of an x-ray astronomer in the other half of my research life, and I don't have to worry that much in x-ray astronomy about sharing the results. I go and I use a telescope. I do a study. I publish it, it's part of a normal science progress and other people in my specialty learn about it. It gets assimilated in some review articles after a few years. And I don't worry that much about promulgation. Publication is a natural and almost complete promulgation process. But in statistics and informatics, I think that isn't true or is less true. First of all, we tend not to publish that much. One reason why there's a new 39 journal being formed over here is we're not publishing very much in referee journals. But it's even more than that. When someone develops a method and someone like Nick just described to us an incredible service, it needs to be used by other people to obtain and achieve its full meaning. So we can write a paper on the data mining method, but what we really want is for hundreds and thousands of people and studies to use the methods. So it's very hard to do this. I personally succeeded once in 1984. I wrote an object paper in 1985 taking a modern field of statistics that more or less astronomy has never heard of and translating it to survivors. It was called Survival Analysis for Non-Detections. And I wrote a code, and it's been used about a thousand times. And I consider that the biggest success of my career. And Jeff Scargle has this success more than once. Certainly the Lomb-Scargle is up there. I believe it's the highest cited paper in astrostatistics. And you're good at a couple of others as well. So it's not as though it can't be done, but it's done rarely, and I think there's so much going on in data mining in this room and in methodology that we really have to emphasize education and promulgation. That's my point. >>: I'm not sure I can speak as eloquently as Eric, but I was making -- >>: No one can. >>: I was making notes beforehand, and I think, I almost wish, you know, can you ask me a technical problem question as opposed to what, I agree with Eric, is almost a sociological one. >>: What's the value of [inaudible]. >>: Oh, 42. Plus or minus. But one of the notes that I wrote down going into this was industrialization. And what we just heard from Eric was an example of that, of I go to conferences, workshops like this, and I've gone to previous ones in which you hear about these great methods, but then actually getting somebody else to use it or producing the software in a form, making it available in a way that somebody else can actually use it without having essentially to reconstruct everything that you did in writing it, I guess that's partly a technical issue, but it's a really, it's clearly a very tough problem, because we don't have that kind of dissemination of lots of software 40 codes in the community. And I, again, I wish it was simply a technical issue of, well, if we just had another website, or if we just, you know, had another something technically, we'd solve it. It's clearly not. It's along the lines of the sociological, getting something into journals in a way that connects. And it may be that part of it is connecting in a way that is doing it with the science questions. And only when codes are used for a couple of different science questions, you know, period -- spectral analysis with a Lomb-Scargle applies in a couple of different areas of subfields. Survival Analysis is something that applies to a couple different areas and people can then make the leap of oh, yeah, I'm not interested in galaxies, I'm interested in stars, but I can see how that would apply to my work. But I wish it was simply a technical problem. >>: The other is when they're adopted by large teams, the teams are getting larger and larger. And when there's a concerted effort to use a particular code because it's advantageous for some reason, people just jump on board that naturally, because that's what's supported by ->>: Whoever put your [indiscernible] on the algorithm used by Sloan to measure magnitude. No. You use a [indiscernible], because that [indiscernible] was from the beginning from the journal of the public. The algorithm where you thought for the Sloan and they staying to the Sloan. So the large teams actually do exactly the opposite. >>: I don't think so. >>: Yeah, but only [indiscernible]. >>: Let's not confuse instrument type with data analysis. >>: Yeah. I will be careful with that. >>: But I guess the point is it's the hook. If you can find a hook somewhere where there are lots of people who are going to follow, because that's the easiest thing to do, then you can get adoption if your code is useful and easy. >>: Let me just a few things that was said. One thing when is I agree with 95 percent of what Eric say. Besides the number, I think his analysis was absolutely correct. One thing. In order which I learned from experience, if 41 you want to obtain good results from that many methods, you need to finely tune the data mining methods. Therefore, you need to [indiscernible] which are not in the background of the register. On the other end, you cannot [indiscernible] that someone else is solving the problem for you unless you are in a large team where you do something we were discussing before. There are several reason for which I think this is one of the notable points. Because if you don't take the optimal model, you don't get optimal results, and therefore instead of doing a good service to the data mining, you do a bad service to data mining. There is an entire community which works in data mining, like there's an entire community this works in statistics. And the basic case, the root square mean astronomer can interact with four or five methods. And, in fact, this is what you see for instance in CANFAR. You have seven. You have just the tip of the iceberg of data mining methods, which are average [indiscernible] but are not optimal problem, optimal methods. So it's difficult. Sometimes I have solution. I have more problems than solution. Maybe this simply is that the real problem is a cultural problem. In the past, we were using statistics. Then we have gone to different [indiscernible] because of the number of objects which we are using is much higher. So we're using statistics in a different way. Data mining is a different form of statistics where you can define [indiscernible] for very, very large data sets, basically where you want to obtain another type of information from your data. It's something which needs to be done. There is no doubt about this. You can like, you can dislike, you can be more or less mathematical prone. But data mining is something which sooner or later needs to be done. And it needs to be done by generation formed in a different way. It must belong to the genetic background of the future astronomy. [indiscernible] produce this genetic mutation. >>: Beneficial one. >>: Yeah, of course. 42 >>: Can I say more or less the same thing, but slightly different way. The problem of the lack of updates of these modern methods by astronomical community has worried me for a long time, and I think there are two parts for it. There is motivation and there is implementation. The motivation part boils down to resources and results. And, of course, you have people using them to get results. So there is [indiscernible] there. The implementation, I think, is two things. Education and usability. Education is a huge problem. [indiscernible] correctly says we do not have yet widespread, high quality curriculum to introduce these methods for science in 21st century for the simple reason professors are ignorant of them. But that has to be somehow evolved out of and we can discuss Thursday some possible ways to go about it. But there is also usability issue. We did look, of course, at a lot of different data mining packaging and [indiscernible] virtual observer. The problem is if you take a piece of commercial software, say, Microsoft Office, somebody shows you what to do, a PowerPoint, and in five minutes you can start doing stuff. Very quickly, you're kind of an expert. It would sure be nice to have at least introductory data mining version of that on every astronomer's desktop and laptop. Now, [indiscernible] is right. To do this really well, you need to know the stuff. But I think there is a whole grade from, you know, simple clustering method, which is displayed and looks right or whatever, to really getting into the guts of the statistics and whatnot. So that is a really big issue. You look at something like [indiscernible] which is a popular data mining package. A statistic manual comes with it. Nobody wants to study this. You know, people don't read one page, you know, read the [indiscernible]. So usability, packaging things in a good way is I think a nontrivial issue, an implementation issue. And maybe we need to work with software professional architects or display architects. >>: Can we just go [indiscernible]. >>: I actually have a very specific answer, suggestion, which you probably don't know about. There's a cram package in R called rattle. Am I right? Rattle? Yeah? So rattle is a GUI for 40 other cram packages for data mining. And essentially makes a PC pull down menu out of R, which is otherwise a 43 command line situation. And what would you say? Have you tried it? It's what I'd call not bad. It's not as good as WIKI. There's R-WIKI, by the way. But rattle is a mechanism for teaching which no one's taken advantage of yet. >>: For teaching, it's good. But for implementation, I have my doubts. >>: It's not made for -- thank you. >>: So the perspective from someone who is not an astronomer listening today, I guess the question I have is to what extent are people thinking of some problems astronomical problems, versus general problems? I know in one of the fields that I work in, in [indiscernible] science, suffered for many years because it kind of kept moving away from everything else that was dealing with the same kinds of analogous process. That is, we run into a problem in these fields of getting somewhat narrowed. So we end up sort of talking only to each other. So what do you find in statistics generally is if you move to the softer side of sciences, the stats get better, okay. Generally, the sociologists are better statisticians than the physicists, right? >>: Sorry, I have a problem. I've been working with sociology. >>: So you know, and the geographers who do spatial stats are very sophisticated. So one of the issues here is how to tap into that market. You know, obviously, R was not written by a bunch of astronomers, right? And things in the MATLAB file exchange are -- the previous talk, there's a whole field of short of shape language model that looks at kind of fitting equations where you think you know something about the underlying process to data. So is this a problem that, are we too [indiscernible]. just a criticism of astronomers. And I think this is not >>: You know when you have no choice, that's a lot of adoption happens. And maybe this tsunami that's been predicted for how many years now is going to impose on people a requirement to use ->>: Sorry. I was just going to say from my own background in radio astronomy, the VLA, which is one of the most productive telescopes ever, operates no -- or 44 before it came the JVLA, it operated nothing at all like what it was designed to operate like. And it wasn't until people actually started getting the data that they then say oh, there are other ways to look at it. I very much think it was a case of it wasn't until there was a necessity that people were forced to analyze the data. It's nice to think about what do I do with a hundred billion record catalog or something. And then when I'm forced to look at a hundred billion record catalog, I suddenly say oh, okay, I better figure this out. >>: In fact, one of the things I like best in these days was the idea to join forces with the other [indiscernible] sciences. I mean, there's so much in common, the problems. And so much in common as solution. And other fields are much ahead of us. For instance, the platform that I'm having problems launching in the astronomical community is used quite a bit by informaticians. And actually not really by informaticians. They're medical doctor use it for diagnosis, for problems. Recently, there was a paper by something about the treatment of diabetic patients which was completely done with [indiscernible]. So I think the problem is not only how to bring [indiscernible] to this community, but also how to build a common forum. >>: Which is what E-science is all about. >>: Exactly. But a real implementation. I don't want to waste ten years of my life doing [indiscernible] just for funding reason if this is already available somewhere. I can -- I'm smart enough to think that I can invent something different to he found. The problem is that there is completely lack of communication. Because the other communities do not communicate with us. We do not communicate with our own community because this community now is [indiscernible] to this mechanism or this huge number of service which just find themselves. The astronomers are only doing service, not anymore. This is my personal opinion. I can be wrong. But they want easy tools to use in which not to think in order to get their results. So it's like the theater, in my opinion. It's something for which we have to get out. >>: Is it a romance or a tragedy? 45 >>: We don't know. >>: So I guess it's worth mentioning that neither in CANFAR or Skytree is anything astronomy specific. So in principle, the only astronomy-specific part of it is that that's who [indiscernible]. So there's nothing stopping anybody else who has data who wanted to use the system. And as far as I'm aware, there are not analogous things to that going on in other fields. So it should be of wider interest. >>: But there is a problem. Trust my experience. If you think that that thing you showed them on the screen is user friendly and will be [indiscernible], forget it. You have to take this thing, [indiscernible] parameters. As soon as they will realize that it is not the click on the mouse there, they will not use it. You can bet on this. It's a wonderful thing. Works wonderfully. There will be three colleagues from the community who will make the effort to mount their libraries, to run them. Because the problem is the following. So far, these things haven't been done in our community by people who are hybrids like me or computer scientists. In this community, how many are the real astronomers? Real for a specific question. That is always a question to make. Can I have a -- please raise hand. Real. George? Two. >>: Wait, what was the question? >>: I'm an astronomer. >>: I am astronomer too, but real -- but it was not special bias for computing or for astroinformatics. Like, for instance, 95 percent of people at the Caltech or 95 percent of the people at Harvard and so that's the problem. It's [indiscernible] people. I don't go into test anything tonight from my hotel room, you know, to see how it works. Like most other people in this room. But we are not the astronomer. >>: Yeah, I mean, one reason is at the level of these [indiscernible] moments, we could make it a bit easier, but it's designed for the people who are happy to do the stuff at the command line, not designed for the astronomer's needs. 46 >>: Which I do have to say, a lot of grad students would fit comfortably in the command line and wouldn't trust GUIs, fancy plans to actually do their real science. >>: So it's designed for real science. >>: It's also what I meant by user -- >>: I do agree that there is a lot of common -- there is a lot of -- it's a general -- data mining is generally not something that could be used for several implementations, several different [inaudible]. Usually use the same tools for [indiscernible]. On the other hand, I think that on top of that, you need some kind of semantic data in which you make things -- tailor things for a specific audience. So astronomers do not know what features they use, but they know what a [indiscernible]. So that's my issue. So it is true that you can have the same tools on [indiscernible], but you need something that is tailored so that you have a common vocabulary. On the other hand, we are seeing that there are tools that are emerging, like [indiscernible] or we heard by one of the Microsoft Researchers the [indiscernible] algorithm that will discover knowledge of [indiscernible]. These kinds of algorithms are promising in that they become black boxes. We should apply very few decisions in our domain, for example, which are the features that you want to use as features and ones that you want to use as [indiscernible]. And then the machinery will do its black magic, and will give you the answer that you will be able to incorporate to discover actual new knowledge. But that could work for some classes of have to classify [indiscernible] that's That's something that you have to do by and you have to [indiscernible] to find problems. But, for example, if you shown before, earlier, that's tricky. knowing what's inside the black box, the actual algorithms that you need. So for some classes of problems, I can subscribe to the idea that using data mining tools can be as easy as making a PowerPoint presentation. But for other, maybe for more interesting and more tricky problems, I cannot subscribe. I think you have to know more than that. [indiscernible] amounts of knowledge 47 that is required to an astronomer, you cannot become both an astronomer, a computer scientist, a software engineer in order to do your science. But there is some knowledge that [indiscernible] in order to use these kinds of tools. And it cannot be as easy as your PowerPoint. >>: I was going to ask, is that really true from the standpoint of I keep coming back to this usability of you can, and I have taught high school students how to use data archives. Because fundamentally -- you can ask very simple questions. They don't have to know anything about what the data are stored or how they're stored or what's actually happening. But the basic idea of find data about N-31 or find data about this galaxy. You go type in the galaxy name, hit return and something pops back. Then you can start asking deeper questions. And you can do the same thing with ADS. You can ask very deep questions with ADS. At the same time if you want to find what papers have I published in the last five years, that's easy. I sort of wonder if there isn't -- I very much like Michael Kurtz's -- sure, there's always going to be these fingers going up, but I think part of this discussion should be how do we move the overall boundary forward so that at least very simple stuff is still being tackled by more modern methods. >>: Just a quick comment. I don't understand why so often we are tended to be polarizing, either black or white. I think what [indiscernible] said is absolutely correct. I've been working for many years in luminosity faction, doing the standard thing. I really get into trouble because I do this [indiscernible] of counts and backgrounds. I called Eric. Eric still a member and [indiscernible] detail the things. The problem is that up to some point, you have standard work. You want to use standard things, which you know, which you learn into your curricular. >>: This is what Madison's published? >>: Exactly. And also at some point, you realize because no one [indiscernible] that there is a problem. So you say here I need the expert. So what [indiscernible] is saying is absolutely right. You have problems, but you need to have a common background in community. Because otherwise, the community will never have even the perception that that problem with which 48 they're fighting can be solved with a data mining approach. You understand what I mean. Because once you have -- then there is the problem of usability and something. So I always track it back to the original, because George said now we have [indiscernible] data miner, WACA, [indiscernible]. Now we have -- the problem is not to find the tool. It's a fantastic data mining tool, MATLAB. It's wonderful. It's not where to find something to do data mining. The problem is to begin to form a community of people who [indiscernible] data mining every time that they recognize and the problem they are confronted that it is a data mining problem. And this is the mine problem in my opinion. Because we know that we need data mining. At least this is what has been repeated year after year since many years because of the data too large or too complex, too articulated, too big process with the traditional techniques. But in the meanwhile, not one step to teach, not one step to teach. >>: I was going to say, we should -- >>: Sorry. I tend to be -- >>: I would like to ask you a question. Which is the single most serious point of failure in the uptake of data mining methodology as a whole, in your opinion? I have my opinion, but I'm biased so I want to see ->>: [indiscernible]. >>: I would like your answer to this question. And the usability of the tools that are out there or the number and wealth of examples of successful [indiscernible] of this kind of framework apply to astronomy. Or simple lack of culture, okay. We're talking about people that actually don't know how to use things because don't have the culture basis to understand what they should do. So I'm asking you three. >>: I think it's the lack of culture, which is what Pepe just said. The lack of knowledge, and the lack of culture, because the tools are there. Even in neural nets, there's a history, 15, 20 years now in neural nets. It's not as though they don't see it. 49 >>: What about now? What about the next ten years? >>: I think there's going to be a tipping point. If Sloan was there, and Sloan did not generally use great methods in terms of I'm not pipeline, I'm talking about sort of the scientists using Sloan. They used traditional methods and wrote fabulous science papers. 2,000 of them. Because of Sloan being so cheap and so successful, every agency in the world is now buying new survey telescopes. NSF is buying LSST, okay. It's just amazing. And so everyone's going to be forced into that direction. And there's young people who are being trained and young people who just are more computer knowledgeable than when we were young, and I think there's going to be a tipping point. So I think we have to be tipping point agents, change agents in the community. >>: Beyond the cliff. >>: We have to be ready for them. We have to prepare these things. to have websites and cook books and examples and other things ready. the tipping occurs, we just have to speed it up. >>: Get out of the way. >>: Do like jujitsu. We have to take the momentum and direct it. years from now, I think it will be better than today. >>: We have And when So five Jujitsu is a better place to be. >>: Last comment. Since there are very few people, since we really need to go [indiscernible] to the community, why don't we try to find the job to meet? At least try to. >>: So let's thank our panel.

1

Related documents

Products

Support

1

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib