1

advertisement
1
>> Ray Norris: -- completion in a couple years. It consists of 36 12 meter
antennas, which is actually not that big as far as radio telescopes on the
whole are concerned. The really innovative thing about it is this 192-pixel
phased array feed. This is a Schmidt telescope, if you like, in radio
astronomy. This makes enormous difference to the survey speed, and this is a
survey telescope. It gives it a 30 square degree field of view. So it's even
better than the Schmidt.
You can regard this step, this brand new technology, no other telescope has yet
equipped with this. It's high risk technology, but it looks like it's working
okay. This is a bit like an optical astronomer's migrate from a single channel
photometers to CCDs. And we've never looked back. And I think the same is
going to be true of radio astronomy in 20 years' time. My guess is every radio
telescope is going to have a phased array feed or [indiscernible].
So the antenna is built in China. [indiscernible] delivered on the site and
we're extremely pleased with them. The surface RMS is twice what we -- sorry,
half what we specified, twice as good as what we specified and they've really
worked out very well indeed. We're very happy with them.
That shows them a month or so back being assembled in the desert they're
assembled in the factory in China. They're taken in pieces, shipped out to
Australia, resembled in the desert. Those of you who know radio telescopes
know that the next step is that you then adjust the panels using halography.
We started to do that. We found that no adjustment was required. And, in
fact, on the 36 antennas, we haven't had to adjust a single panel. They're all
just like they were set in the factory. Very impressive. Good bit of
engineering.
So the control building, that there houses the correlator, which I mentioned
yesterday absorbs ten megawatts of power out where there's no grid so it has
built a small power -- state government's fortunately, building a small PAF
station to run that. It's about the same as a small town.
These are the phased array feeds that we keep on going on about, and you can
see there, that's the front of it. That's the working surface. And you can
see the sort of checkerboard. And if you like, every two corners between the
black squares is full of [indiscernible]. They're very close decoupled. And
then we have 182 receivers behind that.
2
>>: Every receiver is going to be correlated with every receiver and every
telescope ->> Ray Norris: No. It's a bit more -- so what actually happens, they get beam
formed into 30 beams, and then each beam is then correlated with every beam in
every telescope.
Okay. So that is a few weeks ago. All 36 antennas completed.
haven't managed yet to take a photo including all 36 antennas.
to go up in a 747 or something to do that.
Unfortunately,
Someone's got
A big milestone a couple of weeks ago when we got the closure phase. That's
when you correlate the signals and feed antennas using the phased array feeds
and basically hope that thing's zero, which it is. And three days later, we
got our first image. I hasten to say this is with only three antennas. It's
not the world's greatest image, but it shows us all the data parts are working.
I actually thought it was incredibly impressive three days after getting
closure phase. I can say that because I'm not part of the engineering team.
I'm very impressed by what the engineers have done.
Okay. That's all I'm going to say about ASKAP. ASKAP science. ASKAP is
survey telescope, and so there's an open request for proposals. They had 38
proposals and ten were selected. So EMU is what I'm going to mainly talk
about. WALLABY is a corresponding H-1 survey and then there's eight others.
All very important. Many people here were involved in the VAST experiment,
which is transient and probably some people involved in some of the others.
I'm going to talk about EMU up there.
So EMU stands for evolutionary map of the universe. So it's an all sky
continuum survey. Those are all the details. Probably the four key points,
many of you will know the NVSS, which is the largest all sky radio survey at
the moment. So you can compare this to NVSS. And it goes 45 times deeper than
NVSS. This isn't, of course, a criticism of NVSS. I always say NVSS has been
fantastic workhorse for almost 20 years, and technology moves on.
As a result, we will have a database of 70 million galaxies, which we'll
detect. All processing has to be done in realtime. I'll show you the data
flow in a second. And our plan is put all the data in the public domain. So
the idea is once we've finished all the commissioning and debugging, there will
3
be a -- you observe a part of [indiscernible] of ours. You [indiscernible] to
the processing is finished. We have a data quality control step and hopefully,
24 hours later, you'll find the sources. That's our goal. So there's no
proprietary period.
And unusually for radio survey projects, we're hoping, planning to do cross-IDs
and redshifts as well as part of the survey. I'm hoping somebody else can do
it.
So to show you where we fit in with the grand scheme, this shows all the
current, big radio surveys. So we've got sensitivity going along here. So
left is good. And that's the area of sky. And up is good. So up here is
good. Down there is bad. And so you've got NVSS up there, which I mentioned,
which is the biggest sky survey, which we all know and love, which we all use
every day, which is the largest but not very deep.
The deepest one until very recently was [indiscernible] observation which goes
very deep with the VLA, but it's just a single pointing. And on Astro PH is a
new paper by Jim Condon, which is the same field with the [indiscernible] VLA
going a factor of a few deeper. So it's now currently the deepest radio
survey.
And you'll notice that all of these surveys are bounded by this line here.
Atlas I'll mention. That's the survey we've been doing on the Australia
telescope, which I'll mention in a minute. That's important because it's
similar in some ways to EMU. So that basically, that line corresponds to a few
months observing time on any modern radio telescope. You can't get much to the
left of that with existing radio telescopes and that's where EMU sits. So it
really is out there in the white space.
And for those of you who like to think about discoveries plus square centimeter
of parameter space on a diagram you, can count up the number of discoveries
there, and you see why I get excited about EMU.
To go from NVSS to EMU, okay, so
referee on a paper recently said
just need more observing time on
VLA to do the EMU. So it really
>>:
[inaudible].
a factor of 45, so if you want to do -- a
this isn't a fundamental limit, because you
the VLA. Yes, you do, it's 600 years with the
is a fundamental limit, actually.
4
>> Ray Norris:
Yeah.
Come on.
>>:
Take your longevity pills.
>>:
Fundamentally because you have the multiple beams.
>> Ray Norris: That's right. Basically, these phased away feeds gives a
factor of 30 increase in survey speed.
>>: Do you find the galaxy first in continuum and then find the redshift on
the [indiscernible] line, or vice versa?
>> Ray Norris: Can I talk about that in a minute? Or, no, I'll answer it now.
So yes, we're going to produce complete [indiscernible] sources. We then do
cross IDs against optical and [indiscernible] sources. WALLABY, the H-1
survey, will produce redshifts of about one percent to our sources, the nearest
one percent, basically and realign optical for the remainder.
Right. So here's the data challenge. I mean, these [indiscernible] big
numbers, but we [indiscernible] so many big numbers, probably nobody is going
to be at all impressed. So we get nine terabytes per second out of antennas.
We have this ten megawatt correlator out there, which reduced it down to ten
gigabits per second. We have a supercomputer. That's in Perth where we
basically do the heavy duty data reduction.
So the net result is our UV data comes to 70 petabytes a year. But we can't
actually afford to store that. And so we're going to store four petabytes a
year. But all the images, but we throw away all the UV Dat -- all the
structural line UV data. We'd like to keep it, but we can't afford it.
>>:
[inaudible].
>> Ray Norris: It's a few million for four. So it's a bit significant.
Actually, do you know the number? Okay. So for EMU, we just want to store the
images and the catalogs. So our images are 100 terabytes, which is pretty
reasonable. And when we extract the catalogs, we have about 30 gigabytes of
tables, which again is quite reasonable. And by the time we added the optical
data, my guess is 50. That's a bit low. Anyway, it's a manageable amount.
5
We know what the sky looks like at that depth, because we actually have
observed small areas of the sky. This is actually the Atlas survey, which I
mentioned before. And it's important because it's to the same sensitivity and
resolution as the EMU survey. And so we're actually using this as a training
set for many of the data things I'm about to talk about.
So if you look deep down in there, every object you see there is a galaxy.
About half of them are star forming galaxies, about half are AGN. So it's
different from NVSS, where virtually everything is an AGN. Once you get down
to this step, the sky really changes. Most things you're looking at actually
[indiscernible] star forming galaxies. But here, we have a nice head tail, a
relic and a cluster and so on.
Okay. The science goals. Well, basically, the science goals of EMU is to look
at the formation and evolution galaxies. That's the big thing that's driving
it. To go into a bit more detail, we can break that down into a couple of
science goals. But turns out we can also do some really good cosmology with
it, which is something we didn't realize at the beginning.
Clusters, we're also going to do the galactic plane. Basically, it's a
by-product. And, of course, legacy value. But what I'm really going to focus
on in a minute is this explore an uncharted region of observational parameter
space, because that's actually quite an interesting problem, how you do that.
Technical challenges, obviously lots of things with image processing, dynamic
range and so forth which I'm not going to talk very much about. I'm going to
talk very briefly about these three here.
Okay. So the source extraction, we have the way that EMU works, with the EMU
team is 230 people all over the world contributing varying amounts of time, as
you can imagine, and the real work is done by the working groups.
So we have a source extraction working group run by Andrew Hopkins. And the
interesting thing, they did a face off between all the existing source
extractions, things like S-extractor, Sfind, or things you know and love in
[indiscernible] and so on. We find that actually none of them work that well.
All of them miss sources, they introduce spurious sources, particularly if
you're going to run them in an automated way. None are actually going to do
the job.
6
So we've been developing source extraction tools and one of the nice things
about the survey, something that I didn't expect is that we keep on generating
journal papers, even before the survey starts, which is not what I expected at
all.
But when you look at the processes you need to do to make a survey work like
this, the tool is out there and the processes are out there so you end up doing
new stuff, which is nice.
We have a cross identification team led by Loretta Dunne in New Zealand. We're
going to automate the cross-IDs. This won't be a simple nearest neighbor
thing, which she's exploring algorithms at the moment and probably end up with
a Bayesian thing. [indiscernible] is involved in the Bayesian bit of that.
And so with the available surveys, that's Wise, Sloan, sky mapper, VHS, mainly,
we expect to be able to cross-ID about 20, 25 objects. The remaining 20
percent of the images are just too faint and two of these are mainly high
redshift AGNs, which you just can't see in the optical [indiscernible]. And
there's [indiscernible] which are interesting but complicated. And they're
hard.
And so for those ones, we're working with the Galaxy Zoo people, which I don't
need to introduce in this audience, and we're developing with them a thing
called Radio Zoo. So this is the interesting thing. We've got here, there's a
box standard spiral there, and other galaxies are interacting with it. To the
eye, it's pretty bloody obvious what's going on there, and you've also got some
[indiscernible] regions and things in there. But for now, it's really hard.
So we're developing this things called Radio Sky, and we have to have a better
version, at least in a couple months hopefully.
Redshifts, we've got a working group run by Nick Seymour. He [indiscernible]
did the first redshifts for cosmos. And I think Mara was very graceful about
it, but I think she was pretty upset when she found that her nice template
models, which had years of experience built into this, actually performed worse
than a KNN, which is written, knocked up by some bright student, Peter Zin. So
anyway, they're doing a [indiscernible] challenge there, looking at different
classes of objects to see which algorithms work best.
The chance I will end up with, as everybody else does, using an ensemble of
7
these.
Okay. Paradigm shift. For redshifts, you don't always actually want to know
the redshift for every galaxy. I'm talking redshifts here, but the same goes
for a lot of things in a big survey. What you actually want, you want to do
some test. Let's say in cosmology. You want to go through a group of galaxies
and you want to know the redshift distribution of those galaxies. This is
actually quite different from asking what the redshift of an individual galaxy
is.
So if I take out all of our radio sources and I ignore the optical and infrared
data, also measuring polarization, I know that if I just pick out that subset
of [indiscernible], which are polarized in the radio, they'll have a median
redshift of 1.9.
If I take the subset which is unpolarized in the radio, they'll have a median
redshift of 1.1. So I can immediately divide our sample into two sub-groups.
So if I want to look at, let's say, [indiscernible] effect, I can look at
how -- I can do cross-correlation between the CMB and our radio sources and
measure what dark energy was at 1.9 and 1.1. There ought to be bugger all if
this model's right. So in that sense it's a bad example.
But you see the principle. You don't actually need to know about every galaxy.
You use simple diagnostics like polarization or spectral index or the radio
K-relation or even a non-detection is giving you information. And so I call
this approach statistical redshifts, and so we're also exploring this. What
can we serve out the redshift distribution and other properties of the
population as a whole.
Okay. Now they got to it. Almost run out of time. Mining radio survey data
for the unexpected. Right. We're going to find things in our data that we
didn't expect to see there, we hope. Oh, by the way, if you didn't know what
WTF stands for, it's widefield outlier finder. This is the name of the project
and it's to mine the data for the unexpected.
So astronomy does not work by testing a hypothesis. You know, the old thing we
talk to graduates what papa said, you have a hypothesis, you [indiscernible]
test the hypothesis. That's fine. Works really well behind geophysics. Most
of astronomy is not done like that. So you look at the HR diagram, Hertsman
and Russell went out, they decided to correlate these two constants. They
8
found the main sequence and from that we found out about [indiscernible]
evolution.
And if you look at the Nobel Prizes in physics over the last few years, on this
plot, the black ones are the ones where people have just found something
unexpected in their data. The white ones are where people are testing a
hypothesis, ala Kyle Popper. And you can see we actually have gotten 11 Nobel
Prizes from stumbling across stuff, compared to 7 where people have been
testing hypothesis.
So astronomy, most discoveries by going to large areas of unexplored parameter
space and seeing what's there. It's a voyage of exploration, not testing a
hypothesis.
Okay. So let's take an example. Jocelyn Bell, discovery of pulsars. What
happened? Jocelyn Bell, part of a Ph.D., she's looking at a new
[indiscernible] space. She's looking at high time resolution data and she's
looking for scintillation.
And she found these bits of interference, and she's told -- you've got folks in
your thesis, don't worry about that interference, it's probably a
[indiscernible] or something. But she did. She was bloody-minded, she was
persistent, obstinate, and she kept on following up these bits of scruff and
then she found, yes, they are occurring at the same side and time every day.
And so she discovers pulsars and her supervisor got the Nobel Prize for it. I
won't comment on that.
Anyway, but you look at the factors that went into that discovery. Firstly,
she was exploring new area of observational phase space. She knew her
instrument really well. She could tell the difference between rubbish on
retractor and something that was astronomical.
She's observant enough to look at the other things. She's open-minded, she's
prepared for discovery. That's important. Some people aren't. She's in a
supportive environment. People can be expected to make discoveries. And she's
bloody-minded and persistent.
Okay. So let's look at what we're going to do in EMU. Okay. We've got the
stacked parameter space. No problem there. Other discoveries uniformly
distribute across the diagram. Well, Occam's Razor says well, in principle,
9
they probably are. You've got no reason to say they aren't.
be lots of good discoveries up there.
So there should
But what about the difficulty of finding them? Are they equally easy to find?
Well, no. Because up here, we get into large volumes of data and that's where
you get the problem.
So firstly, nobody's really going to be sufficiently familiar with the
instrument. [indiscernible] arm's length, we're using these very sophisticated
software tools to analyze our data. So we will answer the questions we're
asking really well. You pose the question, we'll design the software to answer
it.
What we won't find are the things we're not looking for, the unknown unknowns.
So the question is, can we mine this data by looking for the unexpected? Let's
skip that slide. Well, one thing. If we don't try to do this, we're not
giving the maximum bang for our buck, and people like the NSF won't like us.
Well, we're in Australia so that's all right. ARC in Australia.
>>:
We have a specific instruction against using things like WTF.
>> Ray Norris:
>>:
You mean the acronym?
The implication.
>> Ray Norris: Right. Okay. So we're actually planning a project called WTF,
where we are going to systematically mine the EMU database, discarding objects
that we already know about. Most of the things, of course, we find will be
artifacts. That's okay. They're good for quality control. We'll have the
[indiscernible] results, of course. We've got and then we'll have a few new,
genuinely new class objects.
So we're in the process of building up, figuring out how we're going to do it
verse pushes, decision trees. We've heard all of these today. KFN is the
opposite of KNN. And probably end up using an ensemble approach.
So EMU is an open project. Anybody here is welcome to join, and we'd really
appreciate help. If you got ideas on how to do this, we'd love to have you
join the project. And we hope that these approaches will be useful for other
surveys. So later this year, I'm hoping to put out today's chat if anybody
10
wants to have a go and see what we can dig out of the Atlas data.
And I'll finish there.
>> Yan Xu:
Thank you.
Questions?
>>: There is a very slight conflict between the WTF program and the throwing
away of the UV data.
>> Ray Norris:
>>:
Yes.
Right?
>> Ray Norris:
There is.
>>: So I mean, inasmuch as -- I just hate to give somebody
like this, which is so awesome, any kind of advice. But if
hold your UV data for as long as possible, more statistical
[indiscernible] of it or something, it's really going to be
following up.
running a project
you could possibly
samples of it or
valuable in your
>> Ray Norris: No, we'd really like to. There's the problem, that
[indiscernible] 12 hours a day [indiscernible] takes about 12 hours, and he
needs a supercomputer. You're not going to download stuff on to your Mac and
play with it. Nevertheless, your point is right, yeah. Ideally you would.
It's just finances.
>>: You don't need same time resolution, though.
slices.
>> Ray Norris:
>>:
How do you know?
You can [indiscernible] in
How do you know you don't need --
Compare your transient phenomenon to what was there before.
>> Ray Norris: Yeah. For VAST, that's true. If we're in a funny part of
parameter space, yeah. I'm very wary of making -- I mean, I sort of agree with
you, but I'm wary of making these assumptions. Quick question?
>>: Yes, just to conserve [indiscernible] and they have petabytes of hundred
thousand or something, but [indiscernible] petabytes.
11
>> Ray Norris: I think it's different between buying disks and having them in
a data sensor and rate arrays and mirrors and the rest of it.
>>:
It just adds up.
>> Ray Norris:
>>:
And then you've got to power it for the next --
>> Ray Norris:
>>:
Yeah, and the speed of access, of course.
Okay.
Oh, yeah, power.
Let's move on to our next speaker.
Chenzhou Cui.
>> Chenzhou Cui: So in my talk, I will give a brief overview of our work
during the last several years. So comparing to the [indiscernible] and the
research here, activities in China are still in very early stage. So I hope
you can understand our progress.
First, this is our understanding for the virtual observatory. The virtual
observatory is a data intensive online astronomical research and education
environment, taking advantages of advanced information technologies to achieve
seamless, global access to astronomical information.
So China VO project is the national VO project in China. It's initiated in
2002, just ten years before, by the chinese astronomical community with
recommendation of Jim Gray, who became a member of the [indiscernible].
We mainly focus our efforts for the following fields. So first one is
construction of a China VO platform and to provide unified access to online
astronomical resources and services. And we hope to collaborate with national
and international partners to make these products to be VO enabled. And we
hope to collaborate with astronomers, especially young astronomers to using
real tools and services to show the power of the virtual observatory.
And, of course, we will do our best to use real resources to do public
education and outreach. We are very proud that during the last ten years, we
organized the national-wide VO workshop each year. During the last ten years,
we had ten workshops have been organized. You can see the attendance number.
And since last year, we limited the number of the workshop, but the workshop is
12
so large, it is hard to control so it is [indiscernible] topics.
China VO is an active member for the IVOA. During the last several years, we
[indiscernible]. So first, IVOA small project meeting in 2003. And in 2007,
we hosted the spring interoperability of IVOA.
After about a ten years, the China VO team, the members of the China VO team
came from nationwide universities, observatories and information technology
institutes. For example, NAOC, central China university and the Tianjin
University and the university and the Kunming University of science and
technology are a part of the China VO team and we collaborated with many
international partners, including Johns Hopkins university, Microsoft Research,
Cal Tech, [indiscernible], ICRAR, [indiscernible] and other partners.
As we said before, maybe yesterday, so the goal of the virtual observatory, the
basic goal of the virtual observatory is to provide the seamless global access
to data resources. So data access services are also basic goals are task for
the China VO. So [indiscernible] is Chinese astronomy call datacenter, we hope
to connect the nationwide astronomy [indiscernible] in China and provide
uniform access to astronomers. Currently, we're hosting the following data
sites, including several [indiscernible] from Chinese telescopes. For example,
the LAMOST pilot data survey, LAMOST commission data, and the collaborating
with Stuart observatory, the south galactic cap U-band sky survey and the CSTAR
is a small telescope [indiscernible] operating in Antarctica observatory. And
coming two or three years, there will be more sites coming from Chinese
telescopes.
And additionally, we mirror the popular [indiscernible] from international
partnerships, including the CDS Vizier database, the Sloan SkyServer, 2 MASS.
Just yesterday, I got the disk from [indiscernible] very soon will set up a
mirror site in Beijing for the Chandra.
For LAMOST pilot sky survey, LAMOST is [indiscernible] photographic sky survey
telescope, similar to Sloan Digital Sky Survey. From last year, October, to
the spring of this year, this telescope observed about 300 plans and will get
about a half a million spectrums, most of them [indiscernible] spectrums. For
this sky survey, we provided different data access interfaces, including the
web form, VO interfaces and the command line interfaces.
You can [indiscernible] the observations and you can search just like using
13
Vizier system and there is a [indiscernible] of the spectrum. The output can
be displayed using [indiscernible] or top cat. And it seems quite a part of
the [indiscernible] are selected from the Sloan Digital Sky Survey. So
[indiscernible] the Sloan data will be displayed. And you can submit your
query, SQL query to the database and get results.
And during the last several year, we developed the several small tools. This
is a small tools converter for open office or open [indiscernible] to display
real table files. And there is on screen capture, screen would capture if
you're reading an article, you can select some words and then the keyword will
send back to the [indiscernible] VO database and you can get a result for the
character.
And there is small tool for database administrator to archive a lot of
[indiscernible] into a database. And we based on the [indiscernible] and
developed a unified data access service for virtual observatory. And there is
another one for file manager for [indiscernible] files.
Based on the real [indiscernible], we developed a plug-in for MATLAB and then
used MATLAB on a real table [indiscernible]. We can use MATLAB as a working
bench for data mining on the [indiscernible]. Using the platform, we got to
the first scientific paper from the China VO. We discovered candidate milky
way satellite from the Sloan data.
And there is a work from a partner,
wait and see technology does like a
used high energy of [indiscernible]
data size and then provide users to
just like the concept of a cloud.
Tsinghua University, to use traditional
remote desktop on the integrate frequently
packages to their server and the popular
access the server and [indiscernible] work,
But this work was done about ten years ago. And there's a very pop -important work in China, the e-VBLI network infrastructure. Currently, this
VLBI network composed four radio dishes in Beijing, 50 meters. And in Kunming,
40 meters. Shanghai, they are 25 meters. This is the topology of the e-VBLI
network from China. Four stations linked by the Chinese science and the
technology network.
This is a multiple purpose E-science oriented network used by Chinese
astronomers all at the same time. This network was so deep space mission from
the Chinese government. For example, lunar inspiration from China will use the
14
network for orbit tracking.
And by end of the year, there will be a new component, a 65 meter telescope
will draw the network and four years later with the completion of a fast of 100
meter telescope we'll draw into this new network.
Chinese astronomers has finished thorough GPU-based high performance simulation
work. This is an example from the NAOC. At NAOC, we established a GPU
cluster. The name Laohu is English tiger. This cluster is used by students
and staff from NAOC and other Chinese and international collaborators.
The PI of the cluster is professor Rainer Spurzem from Heidelberg University.
And these clusters [indiscernible] the users has been [indiscernible] 100
currently. Various topics completed throughout this cluster.
This is a hardware system for the cluster, and this is the performance of this
one. The total speed is about 157 teraflops, and we have 85 nodes, each one
with two NVIDIA Tesla GPU cards.
And the BOOTES project is a worldwide network of robotic telescopes, led by
Institute of Astrophysics and the Institute of Astrophysics from
[indiscernible] from Spain. So we became a partner of this product last year
we hosted the first telescope of this product last November. We began to build
the observatory. And just before Christmas, we completed the hardware. And
after the Chinese new year, we got the first [indiscernible] and in March, we
organized an opening ceremony.
So this observatory is a 16 meter small telescope, but once the hardware and
the software system, we can get rich information from this observatory. You
can see the outside view of the dome, and the inside view of the telescope.
These images for our sky camera, one picture, one minute. And there is a lot
of image for the full sky camera.
And you can get the, observe the [indiscernible] in realtime. And you can see
the observation log. All the information for the observatory can get from the
console of the control software. So in China, we have strong requirements for
robotic telescope, and the astronomers show strong interest for the time domain
astronomy.
You see, we have products to build Antarctic observatory and Tibet observatory.
15
And we have station at Argentina, and lunar-based astronomy and we have some
international projects. For example, the SONG telescope network and the SVOM
from France. And for education and amateur observation, robotic observatory
are very useful.
So based on experience and interest of the China RAON, we initiated our idea to
provide China robotic autonomous observatory network product. This is not to
build a specific telescope but we hope to provide technical support and a
solution for Chinese users and we hope to help astronomers and teachers to
build their robotic telescopes.
For the Bootes-4 telescope, we hope to do some further development. For
example, to provide VO-event triggered function for the observatory. And
collaborating with our partners here to develop full automatic photometric
pipeline and full automatic archiving system and a VO-compliant data access and
maybe even very hard for automated event classification.
And the next part I hope to give you some more slides for our education and the
public outreach activities during the last several years. This is a broadcast
of our total solar eclipse in the international year of astronomy.
We organized a large scale broadcast of the total solar eclipse. And along the
solar eclipse belt, we deployed 11 observation stations. And using satellite
and the NASA generation network and the IP basics network, we provide signals
to our clients.
Finally, we got very good results. For example, CNN, ABC, AP, and this is from
Poland, yeah. And many Chinese TV stations using our live broadcast streams.
So our live broadcast is advertised by -- at the front page of the ABA 2009.
This is the results for this live broadcast. During the last [indiscernible],
we established the close collaboration with American south research for the
worldwide telescope.
In 2010, we organized a nationwide guided tour design contest, and about 200
tours are collected at the ceremony, the director of the American south
research Asia and the president of Chinese Astronomical Society attended the
awards ceremony.
And during the last [indiscernible], we give lectures on the classes at
16
different places on the different universities.
Two weeks ago, at the General Assembly of the IU in Beijing, there is a
worldwide telescope booth and many audience attended, audience visited the
booth and we got favorable feedback from the visitors.
And limited by the manpower and the resources, so it is impossible to teach
students one by one. So we train teachers and ask the teachers to teach their
students to [indiscernible] our impact. So we organize several nationwide and
regional trainings for the worldwide telescope. And our last work before
workshop is we provided the WiFi service for the IAU General Assembly, because
the system of the commercial center cannot provide so much connections at same
time for WiFi connections. So collaborating with one company, our team set up
the WiFi service. You see the log. This is a connection number and this is
bandwidth. So you see the largest connections occurred on the first day, but
the member is now 190. And you can see the bandwidth occurred at the second
week on Wednesday.
So also, this scale is not so much, but this is our first time to test our team
to set up this size scale system.
>>: Comment?
Assembly yet.
I think the WiFi at Beijing is the best we've had at any General
>> Chenzhou Cui: Thank you very much. I stop my talk here. So I'm so glad to
work with my colleagues, [indiscernible], three beautiful ladies. Thank you.
>>:
So while the next speaker sets up, do we have one or two questions?
>>: I went and checked out astro box on the web, a MATLAB thing. It looks
like it's a commercial product that you have to pay for. Is that correct?
>> Chenzhou Cui:
products.
>>:
We didn't provide the commercial
That's how good it is.
>> Chenzhou Cui:
>>:
No, no, Open Source.
Maybe so.
Maybe you visited a different website.
I couldn't find a free download.
17
>>: So could I have ten seconds on the lunar observatory that I saw?
see a lunar observatory?
>> Chenzhou Cui:
>>:
Solar eclipse.
Astronomy from the moon you listed.
>> Chenzhou Cui:
Oh, lunar?
>>: Yeah, you had a slide on all the different telescopes.
telescopes.
>>:
That's next year.
>>:
Okay.
The robotic
I would support such a program.
>> Chenzhou Cui:
>>:
Did I
Thank you.
Building all these telescopes, why don't you just buy some from the NSF.
>> Chenzhou Cui: Maybe next year, there will be a small telescope shaped,
launched to the moon from the Chinese ->>:
Oh, that's what you're going to do is land a small telescope on the moon?
>> Chenzhou Cui:
>>:
Very small, about 15 centimeters.
What if it lands upside down?
>> Chenzhou Cui: I don't know. Well, next year, there will be the
[indiscernible] 3 satellite launched. There will be a very small telescope,
just for a test.
>>: We probably shouldn't run -- let's thank him.
Kurtz.
Our next talk is by Michael
>> Michael Kurtz: I'm going to give sort of a case study of how we came to
where we are. This project is ongoing. We haven't finished some of the
18
fainter parts of it.
picture.
It's another year or so.
But I'll end with the current
My collaborators the whole time have been Margaret Geller and Dan Fabricant.
In the beginning, John Hukra. Around the middle and up 'til now, Yan
Delantonio of Brown University and for the last few years, Hitoshi Mizaki of
the National Astronomy Observatory of Japan.
Well, all right, let's start with in 1986, that's what redshift surveys looked
like. That's actually not true. These dots are colored by the galaxy types
and that data didn't exist for another four years. I made this plot in 1991,
as soon as it was possible for, I think, SIGRAF, but it's appeared very many
places now.
But certainly, at that time, we were thinking about what to do next. Also, in
1986, that's a picture of the Arizona mirror lab, and it was being conceived
then. It actually had already been conceived, and discussions were started as
to what to do with it. In particular, their goal is always to build eight
meter mirrors. Should they build a smaller one on the way that would fit
inside the MMT building. And pretty soon, the answer was yes, and it should
have a wide field.
So starting about 1986, we, of course, thought about doing one of these big
strips deep with a six and a half meter telescope.
All right. Well, now this is 1987, a year later. This conference in Garching
is actually very interesting from the history of this field. I spoke there on
classification methods. And at this point in the talk, I'm going to stop and
give an ad for our sponsor anyway, my sponsor, which is ADS.
ADS started at this meeting. I gave a talk in which I showed, among other
things, that you can take digital spectra, treat them as vectors and do
principal components and get a classification space.
Anyway, somebody saw the talk and came up to me and said that, in fact, that's
how you classify text. And that's the start of ADS. ADS didn't happen for
another six years, but it was being worked on from about that time.
So the ADA ad is that ADS now has probably 3 million full text articles. Every
article really ever published in a major journal in physics and astronomy is in
19
ADS full text now and people who are interested in data mining should see us,
should figure out how to use it. It's probably the best collection of
scientific full text that exists anywhere in the world. And we're extremely
happy to have collaborators. We're building the APIs now for them and we need
people to help us, tell us how to build it.
All right. Well, back to the talk, in Garching, that's one of the paragraphs I
actually wrote in the proposal in the paper. I'm talking about, you know, how
you classify voids and bubbles which are only just now found at that time. Two
years before, nobody would have said that, because nobody knew they were there.
Anyway, the last sentence or so I talk about the microwave background
fluctuations which haven't been discovered yet and whether or not there's going
to be some correlation between these bubbles and voids and them and that's, of
course, what the boss scale is, and that's been found and is, indeed, part of
what we're doing with the survey that I'll end with, HectoMap.
Well, all right. I don't know whether you can see any part of that, but it
gets to where the problems are in doing this. Back then, there were not galaxy
catalogs, really. The first map was done from the Zwicky catalog. Zwicky, of
course, looked at the POS plates by I and wrote it all down. This is POS 1
digitized. Look here. And look here. Here, you can see something. And here
you can really see nothing. But I'll show those things several more times in
the talk. So if you sort of remember what that looks like.
The thing to do at that time was to digitize these plates. I met George while
I was at [indiscernible] digitizing plates, if I remember, about that time.
And so we digitize them. We went out to the MMT and observed them.
This is the map we finally came up with. That's several years of what we could
get from MMT time to do that. That was finished in the mid '90s or something.
The century survey is what's that's called. That's where the original slice
is. This goes out to a redshift of 0.15, I think, out there.
Anyhow, that's what you could do back then. And, of course, we knew we were
looking for something better. Well, now let's go forward to 1991, three more
years, four years. And it had already been decided that we were going to build
this telescope. The glass had already actually been bought from O'Hara for the
mirror. The spectrograph had been sort of designed. This is where we figured
out the algorithm for how to put down the fibers.
20
It turns out that we discovered you don't need such a huge angular motion in
the fiber positioner, which made it much easier to actually build the fiber
positioner. So that's all algorithmic work that was done getting ready for the
thing.
This is, I showed that plot at the IAU in 1991. This is the abstract of the
talk I gave, which was again on spectral classification of stars. I wag still
considered Mr. Stellar spectral classification automatic at that time.
Although I hadn't worked on it in, then, nine years.
And here were a couple of things. First, the spectrograph didn't have a name
yet, I think. We thought it would be ready in 1995, four years later, that
it's still going to be 300 fibers, that's true. We thought you could get a
thousand classification quality spectra in an hour for bright objects. Turns
out we never built the grading for this. So we never really did stars with
this instrument. The hyper velocity stars project we do, but very few other
star projects.
But anyway, individual researchers can get on the order of 100 thousand spectra
per year, which will change the field. And, in fact, individual researchers,
small groups can get 20 or 30 thousand, and that's basically what I'm going to
show at the end of the project, the talk.
All right. Well, let's go ahead now three or four more years and if you'll
remember, we thought this would be done then. Of course, with it was still
four years from being done, four years later. So I did this, which is to
describe the whole process of how these things are done. You start basically
by having some large body of knowledge, databases and surveys, and then you
build on it. Somebody has an idea, a new technique is developed. You use it.
It works out. You get more and better. You know, the known is that way and
the body of knowledge is this way.
Basically, it works like this. You've got a body of knowledge. Somebody
invents something, does it. It succeeds. They come back, they do more of it.
Somebody else does some. The guys who did it build on that. Other people
build on it. And eventually, somebody comes around and builds a survey that
completely closes out that observational space. There's no real reason to
observe that stuff again. It's done. Like, you know, basic galaxies with the
Sloan. Right now, who would ever observe them again?
21
Time series, other things, it's a whole different dimension.
whole dimensional thing.
>>:
See, it's this
Check the theory, basically.
>> Michael Kurtz: Yeah, right. I hadn't thought of that, but that's true.
Anyway, so this is how I viewed what would be true. The Sloan survey hadn't
taken its first photon yet. In fact, up in of these things had. What I
thought would be true somewhere, you know, five years ago now so ten years, 15
years in the future from when I did it, there were all these deep probes. The
Australians had one 2DF probably should have been that, and that's now wiggle Z
or something. But those names didn't exist back then. The Hectospec survey,
we thought, would be bigger and more done than it actually turns out to be.
Deep existed, et cetera. The guys at ESO have several surveys now, none of
which are even considered at the time I built this diagram.
But anyway, all right, so that's sort of what we were doing then. And now
we'll get back to the problem at hand, which is how do you go and feed a six
and a half meter sky survey for spectra? Well, this is now the POS 2. Yeah,
this is probably unseeable, but you can see there and here, they're the two
regions I'm trying to have you look at. And if you could actually see this, it
looks a little better than it does on the POS 1.
>>:
White stars on black sky, you call yourself an astronomer?
>> Michael Kurtz: Well, when I made these slides, it was for a dark room.
Besides. Anyway, the nice thing about this is that somebody else had a
measuring engine and mate a catalog. So we didn't really have to go and scan
them anymore ourselves. And because of that, and because we now had CCD
spectrographs instead of redicons, a 60-inch telescope could make this survey.
This not only goes out to redshift of a tenth, it's the same region. What did
we call this? 15R. It goes out to 15 in big R.
Well, all right, so that's what we sort of did. We're still looking to do
these slices. We're still thinking about how to do it when we have the six and
a half meter telescope. But it's still four years away.
So what do we do?
Well, we use that data to simulate what we have when we had
22
the big telescope, of course. And one of the things you have to do is you have
to figure out how to automate the reduction of the data. If you're going to
have so much data. And this is our blender diagram. This is the actual data
from the survey I showed you, plus other regions of the sky.
And this is the blender area. Every dot there is something that you would not
get automatically by the most stupid procedures. And, indeed, we had to fix
these. This turns out to be objects where nitrogen is brighter than H-alpha.
These up here, they can all be understood, but right algorithms had to be
changed so these things all went away.
That one was a difficult one. And this one turns out to be two different
galaxies on top of each other. It's real. I won't explain what this diagram
is, how I exactly got it, but the blunders are up here and you need to do stuff
like that.
Well, all right. Three years later, four years later, they still -- it wasn't
four years away then. It was only three years away. So what did we do? We
decided we'd model the sky subtraction and improve it. So we improved the sky
subtraction. This shows how it is for normal things you can get 80 percent
completeness three magnitudes below sky using just very simple techniques that
are what we would actually have with the survey that we were planning.
Okay. Now, three years later, it was ready. Three years later, the telescope
was built. We were now 2003. So we've gone forward in time now 15 years from
the beginning and eight years from when it was originally supposed to be done.
Well, this -- that's the galaxy, those are the galaxies I was showing you, and
that's the galaxy that you could never have seen on the other ones. This guy
is bright enough to be in the survey, these guys all are, but that guy is not
to show you what I'll eventually show you so you see what it is.
But anyway, we never got it together to do the drift scans, to go and get our
own catalog to do this survey. Because it was so late, there were whole areas
of the space that were taken by other people. Deep existed at that time. The
Sloan was already taking data at that time, which we'd have beaten them if he'd
have been on time, but that's true with everybody.
So what we did is we got this as part of the deep lens survey. This is five by
three arc minutes or so. The deep lens survey. This region is four square
23
degrees and we had competitors, we had Arizona and CFA who were doing something
with the NAO deep field. So basically, we didn't do the survey at that time
that we were planning.
But we did this instead. We did this because we were interested in weak
lensing and looking at the weak lensing properties in the deep lens survey
compared to the spectroscopy, deep spectroscopy.
All right, well, I won't show you any of the results, but this is a noise
diagram. This is a measure of signal to noise. And this is the brightness of
the galaxy down the fiber. It turns out that 21 down our fiber in big R is
almost exactly the same as 21 fiber magnitude for Sloan little r.
So at that level, in an hour, you basically get everything. One of the things
about this is that almost all these things down here were taken with moon, and
we were figuring out exactly how much moonlight we could have and still do the
survey.
The next thing that happened is the Sloan DR-5 came out when had this region.
What this meant for us is we didn't have to collaborate with the people who
were taking the images. We could just do spectra of regions that we wanted to.
Here, you'll see the little guys here. This guy is perfectly visible in the
actual Sloan. That guy is fainter than you're going. That one's about 20.2 or
21.2 or 21.3. The Sloan is perfectly deep enough for a six and a half meter
telescope for almost every healing you'd really want to do. It's not deep
enough for four, five hour integrations. But for our integrations, everything
is in the Sloan.
So this freed us from having to deal with collaborators who do the imaging.
Okay. So the Sloan had colors. So we did this. This basically is a
photometric redshift paper showing that you can do surface brightness versus
color. This is color percentiles as a function of surface brightness. These
are the red ones. Goes that way. That's the bluest ones. And it works pretty
well. You don't really need several colors. A couple just work fine. This is
just R and I versus R surface brightness. You get rid of most of the outliers
if you have G.
All right. Well, we finished that and thought of what to do next. The lensing
project worked pretty well. So this is a lensing map done by the Japanese on
24
the [indiscernible] telescope and we figured we'd just observe that region.
did. And we didn't have to ask them. We didn't have to do anything. They
published their results already. We knew where their peaks were.
We
We, of course, made the mistake of asking them, and now we have more
collaborators and I'll show you that in a moment, but the nice thing about the
Sloan is that we can just do this. None of this had to be arranged in advance.
We could call them up and say we have a redshift survey over your region. Do
you want to collaborate. Not do you want to collaborate, give us stuff. And
that works a lot better.
Okay. So this is the survey itself. It's a film. This is the old CFA
redshift survey from 1986. It will disappear in a moment. This is 0.5 in
redshift. That's the Sloan. This is what we did with the Japanese field, and
now coming in, this is to speed it up in time. That's the first year of
observation after the Sloan. So 2009, I think. Then '10, then '11, then '12.
And that's the final map or will be very shortly. Here it comes. This is
amazing if the film runs to the end, which it doesn't in PowerPoint, which is
the new buzz word for the Smithsonian Institution. This film is actually going
to be shown at the Air and Space Museum real soon now. It may actually already
be, but we only made it last week.
Okay. So to conclude, that's the hyper superime cam, and it turns out the
hyper superime cam is now going to, as its engineering time, image the strip
that I just showed you. So things come around. Now, the guys doing the
photometry are doing the photometry in the region where the spectroscopy is
already done. So that's all I wanted to say. Thank you very much.
>>:
Can the next speaker come up while we take a few questions?
>>: So what area do you actually cover and how many objects have you got
spectral ->> Michael Kurtz: It depends on the deeper the shallow survey. The shallow
survey is 50 square degrees. The deep survey is 35. There are currently about
60,000 spectra. We'll go up to about 75 or 80 thousand.
>>:
How many bands?
25
>> Michael Kurtz:
They're spectra.
>>:
Oh, spectra.
>>:
Is the data public?
>> Michael Kurtz:
>>:
Are you going to put it into --
>> Michael Kurtz:
>>:
No way.
Some cloud --
>> Michael Kurtz:
>>:
No.
Absolutely not.
Why not?
>> Michael Kurtz: Screw that. It's ours. I know that I'm making fun of this,
but, in fact, we're a small group and we really can't afford it.
>>:
Okay.
>> Michael Kurtz: We are the first data is being, you know, managed now and
published now. But we don't have anybody to actually deal with it. The person
who, say, would get together a data paper is me. I'm busy it's just not on my
list of priorities.
>>:
Have somebody write your paper for you.
>> Michael Kurtz: To some extent, I agree. But I'm not all the people in the
collaboration either. So of everybody, I'm the most public one. And it's not
going to happen.
>>:
Let's thank our speaker again.
>> Nicholas Ball: All right. I see we're running a little late so I'll try
and finish relatively quickly. So I'm Nick. I'm from the Canadian Astronomy
Datacenter. Now, well known as the source of 39 million Sloan queries back in
2008.
26
>>:
Was it you?
>> Nicholas Ball: All I can say is that it wasn't me. So these are my
collaborators, David Schade, head of CADC, Alex Gray, who will be here
tomorrow, and Martin Hack, who is also involved in Skytree, and other people.
So just to give an idea who I am, you can divide people up into data miners who
do astronomy, astronomers who do data mining. I am an astronomer who does data
mining. So I will talk about CANFAR. Then I'll talk about Skytree, then I'll
talk about CANFAR and Skytree, give an example of using it, an example of
science, if there's time, and conclusions.
So CANFAR is CADC's cloud computing system. It's pretty unique in being a
cloud computing system for astronomy. The idea is that it provides a generic
infrastructure for processing of, for example, survey data, and it also
provides storage.
The basic specs right are there that are 500 processor cores, up to six
processors and 32 gigabytes of memory per node, and that's going to increase to
256 soon.
Storage is provided by the VOSpace system, and there are several hundred
terabytes of storage available. A key point is that this storage can be
accessed as a mounted file system, so it's just the same as having another
directory on your machine.
The user uses CANFAR, they see a virtual machine, and you just see it as
another terminal. So you SSH to CANFAR. Once you're there, you can install
and run any software that runs on the Linux system, and that's essentially all
astronomy code. And that's a very important thing. CANFAR was set up so that
astronomers can run their own code, which a lot of astronomers like to do.
And once you've set up your virtual machine, you can then run that
interactively if you'd like, or you can run it in batch via Condor, and batch
is good because that gives you access to 500 cores. So that's CANFAR in a
nutshell.
Skytree is a software system that is designed to be the first large scale
machine learning software that is industrial strength, essentially. Their
27
philosophy is rather than try to implement thousands of data mining algorithms,
they implement seven well known algorithms and they do it well. And a lot of
data mining algorithms can be reduced. Maybe reduced is the wrong word, but
there are aspects of these algorithms.
The key thing about Skytree is that it's fast. The implementation is always
scaleable so things that are naively N-squared, for example, say nearest
neighbors scale linearly. The system is robust and there are published papers
showing the accuracy of the speed-ups of the algorithms. This comes out of the
FASTlab of which Alex is also head, which is academic. The key point is that
it allows for publication quality, scientific results.
So as I say, it has an academic and astronomy background. In a sense of an
astronomy background, the sky in Skytree stands for the sky, and tree stands
for the data structure.
It's designed to work on the command line, and it's designed to be called as a
part of a more general analysis. So it is well suited to CANFAR, because
CANFAR is a batch processing command line system.
And you input for example, ASCII data, and you visualize the results elsewhere.
>>:
Is the primary a library that you call from C-code?
>> Nicholas Ball: Yeah, it's a machine learning server. So yeah, you call it
from whatever code, or you can just run it interactively from the command line
itself. So it's designed to provide -- it's essentially, you can almost think
of it as like Unix commands, except now you have machine learning commands
instead. So it's designed to be part of your analysis.
So these are the algorithms that it implements. Just do this quickly. AllKN
is all nearest neighbors. And it goes from N square to N. It does kernel
density estimation with the same speed. This is N objects.
Two-point correlation function, again, and then the speed-ups here are more
complicated, but you could have things like D-cubed to D, where D is the number
of dimensions and support vector machine is more complicated. But it
implements SVM linear regression, singular value decomposition includes PCA.
So it's dimension reduction and also K-means clustering.
28
So CANFAR plus Skytree, this is a powerful system. CANFAR on its own is
powerful. It's like having a supercomputer. Skytree is also powerful. And
when you combine the two, you get a good system to work with.
As I said, you can install your own code on the virtual machine, and you have
access to VOSpace as a mounted system. So this essentially follows the
paradigm of taking the analysis to the data, because the analysis is in can
far, which is in CADC, mostly. And the data is on VOSpace.
So the argument for having Skytree in this way is analogous to the argument for
CANFAR itself. CANFAR itself is a generic infrastructure, that saves you
having to reinvent the wheel to do your own data analysis. And Skytree does
the same thing with respect to performing data mining on large data.
So I'll just show you quickly, as an example of say you want to actually use
it, what do you do? To start with, you request a CANFAR account. You also get
a CADC account, which is trivial. It's just like any other website, login and
password. Then once you have your account, you SSH over to CANFAR to create a
virtual machine. You VM create the virtual machine, then you go through it,
and then say you want to install some software. If you want to small Skytree,
you just download it and untar it. And the only other thing you have to do to
run Skytree is a license server so that you can then run it in batch in
multiple instances.
Typical call to run Skytree looks like the following. It's just a command line
with argument. So here, we're running the nearest neighbors. This is the file
that's going in. It's an ASCII file, and it's slightly modified to the Skytree
format. But that is all very simple. And if there isn't an existing script to
convert your data type, then they'll help you write one. And then there's just
some arguments here and then some outputs. And in this case, it's outputting
the distances to the neighbors.
Then you do whatever with your results. And you can run this entire
interactively on your single virtual machine, or you can run it in batch, up to
500 at once.
So a key part of this is that you don't just work with a virtual machine. You
can also work with us, the people. The aim of the system is to enable better
science. If you have a problem and you maybe want to see how to solve it with
data mining, if you can, then we'd be happy to work with you to do that.
29
And my background is astronomy, so if you send me an astronomy email, I'll
probably understand it. But I can help suggest data mining. A key point which
I should have put on here as well really is that if you maybe need something
more advanced in terms of algorithms and if I don't know the answer, then the
Skytree and the FASTlab are world experts in machine learning. So if I don't
know the answer, maybe they will, and they can be part of a consultation too,
because both the FASTlab and Skytree are very keen to work with astronomers.
And as I say, people like Alex have been working with astronomers for 20 years
now.
So looks like I have time to do this quickly. So my science interest is
luminosity function, large scale structure as well. An interest in this
you look, for example, in Eric's SCMA book over there, the first chapter
estimators for the galaxy luminosity function, so there's a whole
astrostatistics interesting side to this as well. And I think there's a
potential there.
galaxy
is if
is new
lot of
But in this case, if you want to do luminosity function, you can consider doing
Photo-Zs, and if you make Photo-Zs, they have great legacy value so that's what
we've been doing. Skytree nearest neighbor allows you to produce Photo-Z PDFs,
and that's detailed not with Skytree, but it's the same method in this paper.
And because CANFAR lets you run any software, you can run the template based on
500 cores as well. Like many things, the consensus is that Photo-Z is best
done using more than one method and comparing them. So that's what we're
doing.
And we did this, we ran the neighbors for the CFHTLS survey and 130 million
galaxies. And the way we generate PDFs means we generated a hundred instances
of each galaxy, and we did that because it enables us to demonstrate creating a
catalog of size billions of objects, storing and handling that catalog and then
performing data mining on that catalog. In this case, nearest neighbors to get
Photo-Z.
So we have processed analysis T-sized dataset. And that's no small thing,
because you're silting there with hundreds of cores, virtual machines, VOSpace
and it all actually works. So it's quite nice and that's just an example of a
Photo-Z PDF, showing that it's general, it's not Gaussian or anything.
So another thing we wanted to show was that Skytree scales, as they claim, on
30
real data. You really get N-square to N. And another thing, this is in
progress, but we also want to compare it to Open Source alternatives, for
example, R. So we're looking at running various algorithms on various large
catalogs and that's in progress. You can tell we're Canadian here, catalogs.
So, for example, those, and we want to do -- you can do useful things with
that. For example, founding outliers. So I could add to this list now, of
course, EMU, which we heard about earlier. 70 million. That's the same sort
of size when it's done.
So a couple of quick plots just to show what's in progress. This is run time
versus fraction of a dataset. The dataset being 500 million objects. And you
can see that for the points we have, the run time is, indeed, linear. Actually
produced error bars on these points, but the error bars were smaller than the
points. So that's that. And then the next plot shows that at the moment, just
doing this is memory limited. But once we have the 2.56 gigabytes, you can see
we can clearly run this sort of size dataset and run neighbors or whatever on
it and so what this has done is it's found the nearest neighbor to every object
in the dataset.
So to conclude, CANFAR allows storage, processing, and analysis. It's generic,
and you can run your own code. Skytree is fast and robust, and it will allow
publication quality results. CANFAR plus Skytree implements Skytree on up to
500 cores. You can combine this analysis with your own code, and you can store
your data in VOSpace. And that's persistent storage. You can put it there and
you can even have proprietary data there as well with permissions.
So if you're interested, lots of ways to get started. Email CANFARhelp, talk
to me, sign up by the poster. Look at my website, whatever. If you want to
look at the website, I designed this page to be a -- there's a single go-to
CANFAR for Skytree page that has all the information and all the relevant links
on it. So if you're interested, that is the best place to go. Or just have a
look at the poster.
And we are very much encouraging anybody who is at all interested in this to
have a go with it and use it, because it becomes much easier to justify funding
it if there's a base of users. And on a more personal note, thanks.
>>:
And I might add the only speaker to stick to time.
>>:
Sounds like an outfit too good to be true.
You have storage space,
31
computer resources, algorithms and expertise so you can use it. So
essentially, like in no time flat you're going to be besieged with requests.
How are you going to manage resource allocation when that happens?
>> Nicholas Ball: Well, for start, it will be a nice problem to have. So far,
we don't have many users. In terms of managing the resource allocation, well,
jobs are queued by the Condor system so they're queued in a fair way as the
same way as the super computing systems would be.
If we become inundated by users, then it should be much easier to justify
funding to expand the system, because the processing is already distributed
over more than one site, as is the storage, so the whole system is expandable
well beyond 500 cores. And if there's a clear demand for it, then expanding it
should be possible but I don't know the details of that.
>>:
I can start running another millenium simulation there?
>> Nicholas Ball: Yeah, in principle. So the way it works is that projects
with Canadians are guaranteed access because of the way it's funded. Projects
with no Canadians are done on a case-by-case basis. So far, we haven't said no
to any astronomers. So if anybody here is interested, I don't see a problem.
>>: You should advertise it on maybe the Facebook pages or for astronomers or
astroinformatics.
>>:
I recommend he not advertise it.
>> Nicholas Ball:
I want to do whatever it takes to get a base of users.
>>: Are you planning on adding other techniques, besides the ones you listed?
For example, for nearest neighbor, something like [indiscernible] type things
that have variable numbers of neighbors and so on?
>> Nicholas Ball: Yes. So there's a few things listed on the poster that
Skytree is hoping to put out later this year. Things like decision tree
probably neural nets. Time series streaming is something they're working on
and several other things.
They basically say that if there's an algorithm that has a lot of demand, then
they'll add it. The things like the neighbors already have a few refinements,
32
like different distance measurements, I believe. I don't know if they have
[indiscernible], I don't think so. But if it's something that the user is
interested in, they would be very amenable to adding that.
>>: And if I have my home brew stuff I want to do with the output in MATLAB,
would that work? Would you support MATLAB?
>> Nicholas Ball: If MATLAB requires a license, I don't think we could pay for
a MATLAB license. But if you could -- I guess if you could pay for the
license, then you could small some CANFAR and run it.
>>:
The license is beyond control.
>> Nicholas Ball: That was actually a significant thing here, the fact that
Skytree is licensed as well, and it just seems to work fine on 500 cores.
>>: About the [indiscernible] data. Skytree has its own format. A couple of
related questions. One is Skytree would be [indiscernible] so that it would
inter-operate with other VO service provider. But if you have a hundred
million object catalog, you don't want to upload a VO so how do you get large
datasets into the system? Have you thought about caching those datasets so
others could simply access them from your data stream?
>> Nicholas Ball: Yes, so the way I've done it is basically with CSV files and
the Skytree format. Some of the algorithms you can put the CSV straight in,
because they don't need anymore description. Some of them are just headers
saying this column is a double. This column is a whatever and it's still
ASCII, so it's relatively simple. You could even -- you could do it with an
ALT command, almost.
But yeah, in terms of caching data, you could certainly store the data on
VOSpace and allow whichever people access to it. I don't know if there's a
faster system than just the disk that the files are stored on in VOSpace, to
you'd be accessing it off a disk, but that's fundamental to the CANFAR system.
But yeah, you should be able to create whatever data and allow people to use
it.
>>:
[inaudible].
33
>> Nicholas Ball: If there was interest in being able to read VO table, yes, I
would have thought they'd be willing to do that.
>>: [indiscernible]. Moving the algorithms to the data, because they are
still moving the data to [indiscernible].
>> Nicholas Ball:
>>:
Yeah.
And not moving [indiscernible].
>> Nicholas Ball:
Yes, your system is --
>>: This is a simple question. When you say if the user wants to solve the
specific problem he can contact CANFAR or me, does it mean that [indiscernible]
help only if it means you [indiscernible] on the paper because it always
happens in this case.
>> Nicholas Ball: So I don't really have many examples to go on so far, but
yeah, it would be -- I would be some kind of -- I would be consultant,
essentially. So I would assume whether I would ask to go on the paper depends
on how much I contribute.
>>:
Same problem [indiscernible] find a solution.
>> Nicholas Ball: Yeah. So yeah, it's a sort of -- this is -- my trying to
sell myself a bit is kind of related to this, right.
>>: But it's a standard problem, right. That part of doing all this is to
parade your skills to the community. Maybe you can contribute to lots of
different projects.
>>: [inaudible] for instance wonderful work done by [indiscernible] is begun
to [indiscernible] people are beginning to note that statistical information,
data mining is the same. The only [indiscernible] people want to use these
things without even making the effort, do you understand, what a neural network
is or what a decision tree is. So at the end, they come up with the most
absurd questions, and they are upset if you don't give an immediate answer.
You say look, if you don't even know the basics, I cannot even start the
[indiscernible] for you.
34
So this is what I'm asking the results [indiscernible] must be very careful,
otherwise, would be submerged in much people questions.
>>:
Become a help desk.
>>:
They do no the expect to even give you any exchange.
>> Nicholas Ball:
know it's there.
>>:
Yeah, I mean so far, we're just trying to get people to like
There was a paper on Astro PH a month or two ago.
I saw one on CANFAR.
>> Matthew Graham: -- we've been doing at Caltech, which is related to some of
the other stuff that's been talked today, which is we've been playing around
about looking at automatic discovery of relationships in astronomical data
using a tool that's come out of Cornell, [indiscernible] group at Cornell
called Eureqa, or it used to be called Eureqa. It was the subject of a science
paper in 2009. It's now called Formulize. [indiscernible] is part of the same
group, but doing something different.
But the basic idea is the use of the technique of symbolic regression to
determine best fitting functional forms to data, and it does this -- it does a
combined search for both the best form of an equation and then the parameters
of that equation simultaneously to fit the data.
Now you, can specify the type of building block you want to use in the fit in
terms of mathematical building blocks and algebraic operators and other
function, [indiscernible], that sort of thing. And then there are more sort of
advanced building blocks that you can use.
And what it actually does is it uses an evolutionary algorithm to explore a
metric space of numerical partial derivatives of your dataset or actually
variables in your datasets and it's looking for invariants. And the idea is it
runs through -- it's a genetic algorithm, an evolutionary algorithm, and it
produces a final small set of candidate analytical expressions which are the
best fit according to some metric that you've supplied, whether it's goodness
of fit or absolute error or something. And it supplies a sit of final
candidate expressions which vary from, you know, this one uses a small number
of parameters and has -- so is less complex. But maybe doesn't have as good
accuracy to something which is more complex because it may use more complicated
35
functions or higher number of functions.
front of accuracy versus parsimony.
And is more accurate.
On this Pareto
So go to this slide, if you're interested in the software. Unfortunately
they've just changed it so that it's licensed above a basic version. But they
seem to be quite amenable to discussion.
So the sort of thing we've been playing around with, first of all, can we
reconstruct relationships in the Hot Spring Russell diagram if we just get put
in our dataset absolute magnitude versus color, surface gravity temperature and
metalicity as you might find from spectroscopic sample of stars or whatever.
Can we construct relationships.
What we find is if we train it up on the blue data points, which are around
about three and a half thousand stars from [indiscernible] which give a good
coverage of the HR space, and then we apply it to some other data, we get some
very good solutions on them where we get a median that's a new difference
between the [indiscernible] and our [indiscernible] value of about 0.6 in MV,
we get similar errors than when we apply it to segue data, which is the red
stuff, and of the black points are from the rave DR3 dataset.
So we're not overfitting the data in any particular way. It seems to be
finding realistic relationships in the data in terms of fitting functions to
the data.
Similarly, the right-hand side is a plot of the fundamental plane of the
elliptical galaxies. This is the original data, [indiscernible] '87. And
again, you put in the data, you just, you know, these are the values. Can you
find any relationship in the data. So it goes ahead and does that and
reproduces the values from that.
Possibly the more interesting one is that you can phrase the search for a
fitting function in such a way that it actually becomes a binary classification
operation.
So these are the RR LYRAE and WUMA data sets that Sherri was showing earlier.
Obviously, from the light you can't tell anything different between them. So
what we do is characterize them in terms of about 60 different functions and
then that forms the dataset that we put in to Eureqa, and it goes through and
comes up and says the best fitting function which allows you to differentiate
36
between these involves just the period and, in our case, the median absolute
deviation and it ignores the other 58.
So it's doing feature extraction as well. It's not using all of the 68
features. It only picks those which are absolutely relevant. And in the case
when you plot it, you can clearly see that this is very clean separation in
terms of these, and that's because this is a section [indiscernible] diagrams
of these two classes of object and there's a clear separation.
So in these cases, we're getting purity and efficiency equivalent to the
decision tree stuff that Sherri was showing. Similarly, when we look and see
these blazars and [indiscernible] against [indiscernible], we get similar
results as well.
So it's interesting that this technique, which was originally designed for sort
of looking for arbitrary or looking for relationships in the function space can
then be used to identify for doing feature selection, feature extraction for
purposes of binary classification.
So there should hopefully be a paper out on this soon.
minutes.
That was my quick five
>>: The complexity penalty more than just [indiscernible], or is it more
something like a Bayesian?
>> Matthew Graham: I don't think so. The complexity factor does have some
effect on which solutions are selected, because you can weight different
functions with a different complexity value. And that does fall in, in some
ways.
>>:
There's no penalty on complexity?
>> Matthew Graham: There's no penalty on complexity, right. You can impose an
arbitrary penalty if you want, but you don't want [indiscernible] particular
complexity.
>>: [indiscernible] but going to a search space, did you find the complexity
of the relationship [indiscernible]. The fitness function doesn't have that?
>> Matthew Graham:
No, the fitness function doesn't, but you could filter it
37
on.
--
>>: You had a bullet about Pareto parsimony.
complexity?
>> Matthew Graham:
Isn't that just another word for
Sean could probably answer that.
>>: All right. I'll start. Apparently, this is not working. We're supposed
to discuss here sharing of knowledge discovery tools, and I'd like to show the
slide here to show that at the moment, this responsibility of sharing is not a
gentle, kindly thing where more or less equal friends share things that are
more or less equal. It is extremely unequal.
The people in this room know the meaning and use, often, the meaning of what's
on the left hand column. And Google shows that there are millions, or hundreds
of thousands of activities in here, in the world, in the industrial world, the
statistics world, engineering world.
And the right-hand side column shows that the astronomers are roughly five
orders of magnitude behind, okay. Now, there are not as many astronomers. We
don't expect hundreds of thousands of hits. But it would be nice to see
hundreds of hits or even thousands of hits on occasion. You have to remember
that astronomers ->>:
Those numbers, they're not right.
>>:
I got them last year from ADS using --
>>:
No, no, I can give you more data.
>>:
Very good.
>>:
Data mining has 3,000 hits.
>>:
Sorry?
>>: Data mining has 3,000 hits in ADS as a phrase.
phrase, it probably works.
>>:
It's an abstract.
If you don't make it a
38
>>:
If you're looking at abstracts, I just did it.
The number is 2,944.
>>: Something was wrong, I do apologize. It seemed to have been rather poor,
and I'm glad to hear that it's not as poor. Thanks very much. The results of
this, in my opinion, is that -- try random forest.
>>:
Do you want it as a phrase?
>>:
I only do referee papers.
>>:
All right.
So random forest, referee papers, as a phrase.
>>: Okay. So it's my impression that it's rather low. And I think what this
means is that we have more than sharing to do. We have a great deal of
education to do and a great deal of promulgation to do.
Now in statistics, astrostatistics, astroinformatics or sort
fields, the situation is more or less the field. The median
statistics uses a very narrow suite of methods. Mostly 19th
20th century methods, and often uses them poorly and doesn't
methods from the 21st century.
>>:
of sister-related
astronomy paper in
century and early
know the latest
76.
>>: 76, very good. Okay. I did do this a year and a half ago. So it might
be that things have improved. So this could be killed. You can just -- so my
point is that when we use the word sharing, we have a huge amount of sharing to
do as a community of experts. I'm not sure this is right, but I sort of think
that I'm sort of an x-ray astronomer in the other half of my research life, and
I don't have to worry that much in x-ray astronomy about sharing the results.
I go and I use a telescope. I do a study. I publish it, it's part of a normal
science progress and other people in my specialty learn about it. It gets
assimilated in some review articles after a few years. And I don't worry that
much about promulgation. Publication is a natural and almost complete
promulgation process. But in statistics and informatics, I think that isn't
true or is less true.
First of all, we tend not to publish that much.
One reason why there's a new
39
journal being formed over here is we're not publishing very much in referee
journals. But it's even more than that. When someone develops a method and
someone like Nick just described to us an incredible service, it needs to be
used by other people to obtain and achieve its full meaning.
So we can write a paper on the data mining method, but what we really want is
for hundreds and thousands of people and studies to use the methods.
So it's very hard to do this. I personally succeeded once in 1984. I wrote an
object paper in 1985 taking a modern field of statistics that more or less
astronomy has never heard of and translating it to survivors. It was called
Survival Analysis for Non-Detections. And I wrote a code, and it's been used
about a thousand times. And I consider that the biggest success of my career.
And Jeff Scargle has this success more than once. Certainly the Lomb-Scargle
is up there. I believe it's the highest cited paper in astrostatistics. And
you're good at a couple of others as well.
So it's not as though it can't be done, but it's done rarely, and I think
there's so much going on in data mining in this room and in methodology that we
really have to emphasize education and promulgation. That's my point.
>>:
I'm not sure I can speak as eloquently as Eric, but I was making --
>>:
No one can.
>>: I was making notes beforehand, and I think, I almost wish, you know, can
you ask me a technical problem question as opposed to what, I agree with Eric,
is almost a sociological one.
>>:
What's the value of [inaudible].
>>: Oh, 42. Plus or minus. But one of the notes that I wrote down going into
this was industrialization. And what we just heard from Eric was an example of
that, of I go to conferences, workshops like this, and I've gone to previous
ones in which you hear about these great methods, but then actually getting
somebody else to use it or producing the software in a form, making it
available in a way that somebody else can actually use it without having
essentially to reconstruct everything that you did in writing it, I guess
that's partly a technical issue, but it's a really, it's clearly a very tough
problem, because we don't have that kind of dissemination of lots of software
40
codes in the community.
And I, again, I wish it was simply a technical issue of, well, if we just had
another website, or if we just, you know, had another something technically,
we'd solve it. It's clearly not. It's along the lines of the sociological,
getting something into journals in a way that connects. And it may be that
part of it is connecting in a way that is doing it with the science questions.
And only when codes are used for a couple of different science questions, you
know, period -- spectral analysis with a Lomb-Scargle applies in a couple of
different areas of subfields. Survival Analysis is something that applies to a
couple different areas and people can then make the leap of oh, yeah, I'm not
interested in galaxies, I'm interested in stars, but I can see how that would
apply to my work. But I wish it was simply a technical problem.
>>: The other is when they're adopted by large teams, the teams are getting
larger and larger. And when there's a concerted effort to use a particular
code because it's advantageous for some reason, people just jump on board that
naturally, because that's what's supported by ->>: Whoever put your [indiscernible] on the algorithm used by Sloan to measure
magnitude. No. You use a [indiscernible], because that [indiscernible] was
from the beginning from the journal of the public. The algorithm where you
thought for the Sloan and they staying to the Sloan. So the large teams
actually do exactly the opposite.
>>:
I don't think so.
>>:
Yeah, but only [indiscernible].
>>:
Let's not confuse instrument type with data analysis.
>>:
Yeah.
I will be careful with that.
>>: But I guess the point is it's the hook. If you can find a hook somewhere
where there are lots of people who are going to follow, because that's the
easiest thing to do, then you can get adoption if your code is useful and easy.
>>: Let me just a few things that was said. One thing when is I agree with 95
percent of what Eric say. Besides the number, I think his analysis was
absolutely correct. One thing. In order which I learned from experience, if
41
you want to obtain good results from that many methods, you need to finely tune
the data mining methods. Therefore, you need to [indiscernible] which are not
in the background of the register.
On the other end, you cannot [indiscernible] that someone else is solving the
problem for you unless you are in a large team where you do something we were
discussing before. There are several reason for which I think this is one of
the notable points. Because if you don't take the optimal model, you don't get
optimal results, and therefore instead of doing a good service to the data
mining, you do a bad service to data mining.
There is an entire community which works in data mining, like there's an entire
community this works in statistics. And the basic case, the root square mean
astronomer can interact with four or five methods. And, in fact, this is what
you see for instance in CANFAR. You have seven. You have just the tip of the
iceberg of data mining methods, which are average [indiscernible] but are not
optimal problem, optimal methods.
So it's difficult. Sometimes I have solution. I have more problems than
solution. Maybe this simply is that the real problem is a cultural problem.
In the past, we were using statistics. Then we have gone to different
[indiscernible] because of the number of objects which we are using is much
higher.
So we're using statistics in a different way. Data mining is a different form
of statistics where you can define [indiscernible] for very, very large data
sets, basically where you want to obtain another type of information from your
data. It's something which needs to be done.
There is no doubt about this. You can like, you can dislike, you can be more
or less mathematical prone. But data mining is something which sooner or later
needs to be done. And it needs to be done by generation formed in a different
way. It must belong to the genetic background of the future astronomy.
[indiscernible] produce this genetic mutation.
>>:
Beneficial one.
>>:
Yeah, of course.
42
>>: Can I say more or less the same thing, but slightly different way. The
problem of the lack of updates of these modern methods by astronomical
community has worried me for a long time, and I think there are two parts for
it. There is motivation and there is implementation.
The motivation part boils down to resources and results. And, of course, you
have people using them to get results. So there is [indiscernible] there. The
implementation, I think, is two things. Education and usability. Education is
a huge problem. [indiscernible] correctly says we do not have yet widespread,
high quality curriculum to introduce these methods for science in 21st century
for the simple reason professors are ignorant of them. But that has to be
somehow evolved out of and we can discuss Thursday some possible ways to go
about it.
But there is also usability issue. We did look, of course, at a lot of
different data mining packaging and [indiscernible] virtual observer. The
problem is if you take a piece of commercial software, say, Microsoft Office,
somebody shows you what to do, a PowerPoint, and in five minutes you can start
doing stuff. Very quickly, you're kind of an expert.
It would sure be nice to have at least introductory data mining version of that
on every astronomer's desktop and laptop. Now, [indiscernible] is right. To
do this really well, you need to know the stuff. But I think there is a whole
grade from, you know, simple clustering method, which is displayed and looks
right or whatever, to really getting into the guts of the statistics and
whatnot.
So that is a really big issue. You look at something like [indiscernible]
which is a popular data mining package. A statistic manual comes with it.
Nobody wants to study this. You know, people don't read one page, you know,
read the [indiscernible]. So usability, packaging things in a good way is I
think a nontrivial issue, an implementation issue. And maybe we need to work
with software professional architects or display architects.
>>:
Can we just go [indiscernible].
>>: I actually have a very specific answer, suggestion, which you probably
don't know about. There's a cram package in R called rattle. Am I right?
Rattle? Yeah? So rattle is a GUI for 40 other cram packages for data mining.
And essentially makes a PC pull down menu out of R, which is otherwise a
43
command line situation. And what would you say? Have you tried it? It's what
I'd call not bad. It's not as good as WIKI. There's R-WIKI, by the way. But
rattle is a mechanism for teaching which no one's taken advantage of yet.
>>:
For teaching, it's good.
But for implementation, I have my doubts.
>>:
It's not made for -- thank you.
>>: So the perspective from someone who is not an astronomer listening today,
I guess the question I have is to what extent are people thinking of some
problems astronomical problems, versus general problems?
I know in one of the fields that I work in, in [indiscernible] science,
suffered for many years because it kind of kept moving away from everything
else that was dealing with the same kinds of analogous process. That is, we
run into a problem in these fields of getting somewhat narrowed. So we end up
sort of talking only to each other.
So what do you find in statistics generally is if you move to the softer side
of sciences, the stats get better, okay. Generally, the sociologists are
better statisticians than the physicists, right?
>>:
Sorry, I have a problem.
I've been working with sociology.
>>: So you know, and the geographers who do spatial stats are very
sophisticated. So one of the issues here is how to tap into that market. You
know, obviously, R was not written by a bunch of astronomers, right? And
things in the MATLAB file exchange are -- the previous talk, there's a whole
field of short of shape language model that looks at kind of fitting equations
where you think you know something about the underlying process to data.
So is this a problem that, are we too [indiscernible].
just a criticism of astronomers.
And I think this is not
>>: You know when you have no choice, that's a lot of adoption happens. And
maybe this tsunami that's been predicted for how many years now is going to
impose on people a requirement to use ->>: Sorry. I was just going to say from my own background in radio astronomy,
the VLA, which is one of the most productive telescopes ever, operates no -- or
44
before it came the JVLA, it operated nothing at all like what it was designed
to operate like. And it wasn't until people actually started getting the data
that they then say oh, there are other ways to look at it.
I very much think it was a case of it wasn't until there was a necessity that
people were forced to analyze the data. It's nice to think about what do I do
with a hundred billion record catalog or something. And then when I'm forced
to look at a hundred billion record catalog, I suddenly say oh, okay, I better
figure this out.
>>: In fact, one of the things I like best in these days was the idea to join
forces with the other [indiscernible] sciences. I mean, there's so much in
common, the problems. And so much in common as solution. And other fields are
much ahead of us. For instance, the platform that I'm having problems
launching in the astronomical community is used quite a bit by informaticians.
And actually not really by informaticians. They're medical doctor use it for
diagnosis, for problems.
Recently, there was a paper by something about the treatment of diabetic
patients which was completely done with [indiscernible]. So I think the
problem is not only how to bring [indiscernible] to this community, but also
how to build a common forum.
>>:
Which is what E-science is all about.
>>: Exactly. But a real implementation. I don't want to waste ten years of
my life doing [indiscernible] just for funding reason if this is already
available somewhere. I can -- I'm smart enough to think that I can invent
something different to he found. The problem is that there is completely lack
of communication. Because the other communities do not communicate with us.
We do not communicate with our own community because this community now is
[indiscernible] to this mechanism or this huge number of service which just
find themselves. The astronomers are only doing service, not anymore. This is
my personal opinion. I can be wrong.
But they want easy tools to use in which not to think in order to get their
results. So it's like the theater, in my opinion. It's something for which we
have to get out.
>>:
Is it a romance or a tragedy?
45
>>:
We don't know.
>>: So I guess it's worth mentioning that neither in CANFAR or Skytree is
anything astronomy specific. So in principle, the only astronomy-specific part
of it is that that's who [indiscernible]. So there's nothing stopping anybody
else who has data who wanted to use the system.
And as far as I'm aware, there are not analogous things to that going on in
other fields. So it should be of wider interest.
>>: But there is a problem. Trust my experience. If you think that that
thing you showed them on the screen is user friendly and will be
[indiscernible], forget it. You have to take this thing, [indiscernible]
parameters. As soon as they will realize that it is not the click on the mouse
there, they will not use it. You can bet on this. It's a wonderful thing.
Works wonderfully. There will be three colleagues from the community who will
make the effort to mount their libraries, to run them. Because the problem is
the following. So far, these things haven't been done in our community by
people who are hybrids like me or computer scientists.
In this community, how many are the real astronomers? Real for a specific
question. That is always a question to make. Can I have a -- please raise
hand. Real. George? Two.
>>:
Wait, what was the question?
>>:
I'm an astronomer.
>>: I am astronomer too, but real -- but it was not special bias for computing
or for astroinformatics. Like, for instance, 95 percent of people at the
Caltech or 95 percent of the people at Harvard and so that's the problem. It's
[indiscernible] people. I don't go into test anything tonight from my hotel
room, you know, to see how it works. Like most other people in this room. But
we are not the astronomer.
>>: Yeah, I mean, one reason is at the level of these [indiscernible] moments,
we could make it a bit easier, but it's designed for the people who are happy
to do the stuff at the command line, not designed for the astronomer's needs.
46
>>: Which I do have to say, a lot of grad students would fit comfortably in
the command line and wouldn't trust GUIs, fancy plans to actually do their real
science.
>>:
So it's designed for real science.
>>:
It's also what I meant by user --
>>: I do agree that there is a lot of common -- there is a lot of -- it's a
general -- data mining is generally not something that could be used for
several implementations, several different [inaudible]. Usually use the same
tools for [indiscernible]. On the other hand, I think that on top of that, you
need some kind of semantic data in which you make things -- tailor things for a
specific audience. So astronomers do not know what features they use, but they
know what a [indiscernible].
So that's my issue. So it is true that you can have the same tools on
[indiscernible], but you need something that is tailored so that you have a
common vocabulary.
On the other hand, we are seeing that there are tools that are emerging, like
[indiscernible] or we heard by one of the Microsoft Researchers the
[indiscernible] algorithm that will discover knowledge of [indiscernible].
These kinds of algorithms are promising in that they become black boxes. We
should apply very few decisions in our domain, for example, which are the
features that you want to use as features and ones that you want to use as
[indiscernible].
And then the machinery will do its black magic, and will give you the answer
that you will be able to incorporate to discover actual new knowledge.
But that could work for some classes of
have to classify [indiscernible] that's
That's something that you have to do by
and you have to [indiscernible] to find
problems. But, for example, if you
shown before, earlier, that's tricky.
knowing what's inside the black box,
the actual algorithms that you need.
So for some classes of problems, I can subscribe to the idea that using data
mining tools can be as easy as making a PowerPoint presentation. But for
other, maybe for more interesting and more tricky problems, I cannot subscribe.
I think you have to know more than that. [indiscernible] amounts of knowledge
47
that is required to an astronomer, you cannot become both an astronomer, a
computer scientist, a software engineer in order to do your science.
But there is some knowledge that [indiscernible] in order to use these kinds of
tools. And it cannot be as easy as your PowerPoint.
>>: I was going to ask, is that really true from the standpoint of I keep
coming back to this usability of you can, and I have taught high school
students how to use data archives. Because fundamentally -- you can ask very
simple questions. They don't have to know anything about what the data are
stored or how they're stored or what's actually happening. But the basic idea
of find data about N-31 or find data about this galaxy. You go type in the
galaxy name, hit return and something pops back. Then you can start asking
deeper questions.
And you can do the same thing with ADS. You can ask very deep questions with
ADS. At the same time if you want to find what papers have I published in the
last five years, that's easy. I sort of wonder if there isn't -- I very much
like Michael Kurtz's -- sure, there's always going to be these fingers going
up, but I think part of this discussion should be how do we move the overall
boundary forward so that at least very simple stuff is still being tackled by
more modern methods.
>>: Just a quick comment. I don't understand why so often we are tended to be
polarizing, either black or white. I think what [indiscernible] said is
absolutely correct. I've been working for many years in luminosity faction,
doing the standard thing. I really get into trouble because I do this
[indiscernible] of counts and backgrounds. I called Eric. Eric still a member
and [indiscernible] detail the things.
The problem is that up to some point, you have standard work. You want to use
standard things, which you know, which you learn into your curricular.
>>:
This is what Madison's published?
>>: Exactly. And also at some point, you realize because no one
[indiscernible] that there is a problem. So you say here I need the expert.
So what [indiscernible] is saying is absolutely right. You have problems, but
you need to have a common background in community. Because otherwise, the
community will never have even the perception that that problem with which
48
they're fighting can be solved with a data mining approach.
You understand what I mean. Because once you have -- then there is the problem
of usability and something. So I always track it back to the original, because
George said now we have [indiscernible] data miner, WACA, [indiscernible]. Now
we have -- the problem is not to find the tool. It's a fantastic data mining
tool, MATLAB. It's wonderful. It's not where to find something to do data
mining. The problem is to begin to form a community of people who
[indiscernible] data mining every time that they recognize and the problem they
are confronted that it is a data mining problem.
And this is the mine problem in my opinion. Because we know that we need data
mining. At least this is what has been repeated year after year since many
years because of the data too large or too complex, too articulated, too big
process with the traditional techniques. But in the meanwhile, not one step to
teach, not one step to teach.
>>:
I was going to say, we should --
>>:
Sorry.
I tend to be --
>>: I would like to ask you a question. Which is the single most serious
point of failure in the uptake of data mining methodology as a whole, in your
opinion? I have my opinion, but I'm biased so I want to see ->>:
[indiscernible].
>>: I would like your answer to this question. And the usability of the tools
that are out there or the number and wealth of examples of successful
[indiscernible] of this kind of framework apply to astronomy. Or simple lack
of culture, okay. We're talking about people that actually don't know how to
use things because don't have the culture basis to understand what they should
do.
So I'm asking you three.
>>: I think it's the lack of culture, which is what Pepe just said. The lack
of knowledge, and the lack of culture, because the tools are there. Even in
neural nets, there's a history, 15, 20 years now in neural nets. It's not as
though they don't see it.
49
>>:
What about now?
What about the next ten years?
>>: I think there's going to be a tipping point. If Sloan was there, and
Sloan did not generally use great methods in terms of I'm not pipeline, I'm
talking about sort of the scientists using Sloan. They used traditional
methods and wrote fabulous science papers. 2,000 of them. Because of Sloan
being so cheap and so successful, every agency in the world is now buying new
survey telescopes.
NSF is buying LSST, okay. It's just amazing. And so everyone's going to be
forced into that direction. And there's young people who are being trained and
young people who just are more computer knowledgeable than when we were young,
and I think there's going to be a tipping point.
So I think we have to be tipping point agents, change agents in the community.
>>:
Beyond the cliff.
>>: We have to be ready for them. We have to prepare these things.
to have websites and cook books and examples and other things ready.
the tipping occurs, we just have to speed it up.
>>:
Get out of the way.
>>: Do like jujitsu. We have to take the momentum and direct it.
years from now, I think it will be better than today.
>>:
We have
And when
So five
Jujitsu is a better place to be.
>>: Last comment. Since there are very few people, since we really need to go
[indiscernible] to the community, why don't we try to find the job to meet? At
least try to.
>>:
So let's thank our panel.
Download