1

advertisement
1
>> Raffaele D'Abrusco: Thank you.
of what I mean by work flow, okay?
Let's start with the very name definition
Sometimes it's useful to define boundaries.
So I'll just show you how we can use different kind of techno methods borrowed
from statistical analysis, advanced statistical analysis and machine learning
to address a single problem. Now, these kind of tools can be chained together
in order to cover and answer different questions about, in this case, how we
can select blazars and how we can characterize the spectral [indiscernible]
distribution.
So if you have a good work flow, from my point of view, what you're going to do
is being able to find new correlations, interesting things that nobody noticed
before or generalize already known correlations to a higher dimensionality.
Just think about adding new colors or new observable to something, some
relation that we already know. Or find simple correlation that for some reason
has been overlooked so far. And it still happens.
And the thing I'm going to discuss in a couple slides is an example of this
simple correlation that were not known before the application of KD methods.
And, of course, you want to further your work and you want to use this
discovery or correlation that you've found by applying the knowledge that you
have -- sorry, applying other KD methods to classify sources, to extract new
knowledge and to reduce or cover data that are sitting out there and it's
something that's really close to my heart because I'm working, have been
involved in the virtual observer so I really aim to be able to reuse data that
are sitting there in the archives and are available to astronomers.
As Pepe Long mentioned in a previous talk, I have been his student for a while
during the Ph.D., and we were focused about two simple, let me tell you, simple
in the sense that they can be easily defined in terms of a KD application task.
Which is the classification of kind of the quasars, the extraction of optical
quasar based only in photometric data, and the determination of photometric
redshift, as Pepe already discussed.
I have defined in this slide a couple of works that my collaborator and I have
been working on the last three years. And the basic idea, at a very general
level, what we did is applying unsupervised clustering algorithms to determine
the better, best, sorry -- the better, the best distribution sourcing to
different groups in usually high dimensional from the space in order to
2
optimize the training of regressors in the case of photometric redshift or
classifiers for extracting quasars and distinguishing quasars from stars is
concerned.
So this work, very two simple tasks, but they can be tackled using really
different combination of KD methods. For example, in the first -- in this
example, which is a description of the weak gated expert that [indiscernible]
and I have been working on last year, published last year, we have two layers
of KD. Three layers, actually. The first one is unsupervised clustering in
the coral space of quasars or galaxies.
The second layer is what we call the first layer of experts. We have
regressors, supervised trained experts that learn to recognize patterns in the
data linking the cause of the sources to redshift.
And then we have the third and last layer, where a gating [indiscernible],
which is another neural networks, learn how to combine different outputs from
the second layer experts in the best way. Let me tell you that the most
interesting thing is this one, the most two interesting things that I want to
tell you is that this method does not require any fine tuning for working on
different kind of sources. We applied it to galaxies and quasars, and we all
know that the spectral array distribution and the correlation between colors
and redshift are quite different, because different are the additional
mechanisms at play.
And second thing, we use neural networks because they were simple and we had a
very narrow kind of parameters so we could play with a small set of parameters
for each given expert. But every single expert could be any kind of regressor
or any kind of supervised tool. We could always add template fitting methods
as single experts and combine them. So in general, it is expendable and can be
used to address potentially other kind of problems involving simple
classification or regression.
Well, so the question is, can we extend this kind of approach to a more general
equation. So can we use unsupervised clustering to try and find whether any
kind of pattern correlation between a set of observables that we used to
cluster, that we used to build the feature space where we apply or cluster
algorithms and some other observable that we are not using to try and
understand whether some interesting signal is present in our data. These kind
of equations that is related [indiscernible] distribution of sources inside
3
each cluster relate to the distribution rel to some outside observables, we try
to answer to this equation, generalizing the approach that I showed you in the
slide before and coming up with the CLaSPS method.
Basically, we have three simple steps. We decide when we have a lot of
observables, usually have very large number of observables, we decide which
will be used to create, build our space where we actually perform the
clustering. And let's say that these are the colors, because most of this work
was inspired by the main use case which was trying to link the classification
which usually is obtained by spectroscopic data or looking at the absence or
presence of a missionary x-ray, radio gamma rays to the actual shape of the SED
of sources, particularly [indiscernible] we tout fitting the SEDs. This was
the point, just looking.
And now the points are distributing some parameter space. So we have this
first layer where we perform clustering, and then we apply some [indiscernible]
measure of how distributive members in each classes relative to some other
observables which we have not used for clustering using some mathematical trick
and I will show you the number, the kind of diagnostics that we define to do
that. Then we try to understand whether the patterns that we find are due to
some instrumental effects or just noise in our data or they actually are
[indiscernible] that can be used for astronomy.
The basic, the core of this method is this number here which we call score for
the absence of any better ideas. It's basically the measure of the number and
members of a cluster that belongs to each [indiscernible] which we are dividing
them distribution of labels. So I'm calling features the observables that we
use for the clustering. And labels, the things that we attach to the single
members of the clusters to understand how these clusters are formed.
Let me do a very simple example. If we have stars and quasars, just as I told
you before, the clusters will be just binary clusters, zero, one, stars and
quasars, and I'm telling one quasars and zero stars, because we're interested
in quasars. But it will be all the opposite, of course.
But the key point is measuring this FID, which is the fraction of sources in
the [indiscernible] cluster which belong to the [indiscernible] of the label
distribution. In this case, zero one. Then we wait this for the number of
clusters. We evaluate this number for each cluster we are clustering and for
different clusterings, where the differences between clusters can be the total
4
number of clusters and the method that we use to achieve this clustering.
And these are very simple, two-dimensional representation of the high
dimensional structure of the clusters in terms of the scores. And then we try
to find higher values of this number, which tell us that something is going on.
And then we try to determine whether this [indiscernible] is real or not
through simulation, okay?
So our first interesting application of this method was just trying to
determine, as I told you before, if you can find new ways to characterize the
high dimensional distribution of sources in the color space, especially AGNs,
using as labels the spectroscopic classification of the absence or presence of
x-ray [indiscernible] quietness and other things like that.
So we use basically a large AGN catalog where we add the labels that told us
whether a guy was, a source war blaze ar or was a [indiscernible] one, two. If
it had radial emission or not, and so on.
And we use these features, very narrow data from a very narrow range in the
electromagnetic spectrum. As you see, we use feature points that are located
in this region, basically going from the median [indiscernible] WISE to the
[indiscernible], the galaxy. Of course, these, all this bunch of point is not
representative of the features that we actually use, but just to show you the
range, the spectral range where we worked.
And we used as labels information that are coming from regional DSDDs that are
far away from the region where our features are located.
The interesting thing that we had, apart from patterns that did not lead us to
any place interesting, is that we found that the particular family of sources,
the blazars, had the very strict correlation with two particular labels, which
were the emission, the gamma ray emission, the fact that they could be seen in
gamma ray skies, thanks to firm data, and the subclassification, spectral
classification which was available for blazar in flat spectrum radio quasars
and BL Lacs. I can just say that BL Lacs are the real blazars, okay. The
prototypical. While flat spectral radial are sources this show most of the
blazar straights but in some way are contaminated by other components, in terms
of the SED or the sources.
And we nailed down this correlation to basically few, very small number of
5
classes that were custom through all the classes that we could perform, using
different methods. Let me also tell you that the number of source in this case
was quite small and the dimension of the space was 12 or 11. So not that
large.
But interesting thing is that we could recognize there was something constant
along all the clusterings that we did. So how could we understand what was
going on? We basically did what any astronomers would have done in that case.
After find there was something strange, had a large value of the score, so we
basically run PCA, personal principal encumbrance and we determined that
basically the correlation between these two labels and the distribution of
blazars in our high dimensional parameter space could be justified in terms of
data distribution in terms of the medium Fred-WISE color space.
So going from an 11-dimensional space to a three-dimensional space, which is
something that anybody else could have done in principle, but for some reason
probably my explanation is that the strongness of bias not had been done
before.
So we applied some automatic way to track down correlation in the larger or the
smaller projection of the parameter space and we determined basically the
region, the SEDs of these sources was responsible for this correlation.
And the interesting thing is that if you try and plot in a very simple
two-dimensional projection, the color space generated by Y filter, C23, C12,
the distribution of blazars in our sample, you can see that they occupy a very
peculiar narrow regional parameter space.
Let me tell you that the density here is basically zero. So all the sources
that we can find in a given region of the sky, looking at the WISE photometric
catalog that are located here are blazars. The contamination is between zero
and two, three percent. And this region is occupied by the BL Lacs, the
regional prototypical blazars.
In this region, we have a larger contamination, of course, where most of the
quasars are usually fine in the color space by WISE, but we're talking about
20, 25 percent contamination. That's a very good number, if you want to
produce a list of candidate blazars that you can use in full op.
So the first thing that we did was trying to model this locus, what we call
6
three-dimensional locus of points in order to use this locus to extract a new
candidate blazar from the WISE catalog or confirm whether gamma ray emission,
which was not associated to any source, known source, could have been justified
by the presence of candidate blazars.
And we did what [indiscernible] would have done in space. We produce a very
simple modeling, just imposing that our boundaries could contain 95 percent of
the sources. And this worked quite well, because we were able to
[indiscernible] to candidate blazars 156 of the unidentified gamma ray sources
from the [indiscernible] catalog. Of course, we are not saying that 156 source
are actually blazars, because we need photo op spectroscopic observation to
confirm it 100 percent. But I'm just saying that as some people speculated
that based only the density surface base to blazars, a large fraction of these
unassociated sources could have been associated to actual blazar. That's a
good thing and we are planning -- basically, we ask for a [indiscernible] to
confirm this.
And then came up to my mind, we are doing it in the wrong way, because we are
basically giving a very simple model. We need now to think in terms of data
mining. So what we did is treating the largest, the modeling of the largest as
a supervised classifier that can be -- that has to be described in a very small
number of parameters that can be fine tuned in terms of the quantitative
numbers, let's say the numbers that we are interested in, which is the ability
to reconstruct the shape of the SED, the efficiency of the selection process,
and the completeness of the selection process and the other constraints that we
put -- that we decided to impose in order to be extendable.
Catalogs of blazars change with time. You basically have found WISE from the
preliminary release [indiscernible] changes something. We want to be able to
repeat [indiscernible] very quickly and automated way going over.
And basically what we did is finding that we can work very effectively in the
principal component space and we construct quantitative measure which we call,
not surprisingly, score, again. Now, this is just like fantasy on my site
which tells us which is the probability that our score is compatible with the
largest of the source.
Of course, the largest number, the largest probability that the course of this
WISE source, unidentified WISE source is a blazar.
7
Again, this does not mean that they are [indiscernible] basis, but this is a
huge step forward in order to determine which source should be observed to
[indiscernible]. And, of course, this kind of modeling also help us to perform
special unconstrained certain.
So far, we have looked into the region of the sky where gamma ray emission has
been observed and where something tells us that there might be blazars. But
what about just filtering all the WISE catalog looking for sources that satisfy
our constraints on the WISE color space and see whether in a second moment
these can be associated to something that smells like blazars in their data.
And that's exactly what we did. I'm just going to focus on this second
application which is much more interesting in terms of reusing color data. We
know that blazars are variable all over the [indiscernible] spectrum. So we
went through the [indiscernible], which are these little nodes that astronomers
that observe a transient in the sky write about what they see, and we looked
for sources in the region of the sky where these transits were observed that
could have been associated with our candidate.
We found that out of the 500 and something, we could reassociate to candidate
blazars 50 of them, and for 12 of them, which is very large number, we found
spectroscopic data somewhere else that helped us to confirm their nature.
Other people had observed samples of [indiscernible] and they basically found
samples which, sources where spectral did not fit into the usual scheme of
Siefert, and they left it there. So we went through the archive and we found
that this blazar, this spectral are consistent with our picture. That tells us
that these sources are blazars.
So why this kind of experiment? Because we could apply these to the huge
availability archives that are out there. I'm thinking about optical
variability. I'm thinking about the transient that have been observed in the
same gamma ray sky but have not been associated to any source in the catalog
just because they last a few hours, a few days, and they didn't make the cut in
terms of [indiscernible] ratio.
So could we enlarge the number of known blazars? I'm positive that we can, and
this is a never ending number of applications. But the interesting thing is
that we could also challenge -- a challenge is made to mark more complex
dataset.
8
A colleague of mine with whom I'm collaborating has been awarded a very
long-term time on Chandra to observe a larger region containing the cosmos, the
Chandra cosmos region so we have will a very large encompassed dataset to apply
this kind of method. We are talking about [indiscernible] information
classification coming from the [indiscernible] spectrum, and this is a very -quite almost perfect dataset to test the [indiscernible] of this method.
Of course, we need to improve the method, because now we are handling with the
real dataset, in the sense that the previous dataset was picked. We chose only
the sources for which we had all measurements and all [indiscernible], for
example. I'm not going to lie to you. Our dataset was a test dataset that led
to interesting discovery, but we hand pick our sources.
In this case, we're going to handle something which has not a lot of numbers,
no measurement upper limits and the kind of things that really make real data
so interesting and so difficult to handle from the point of view of data
management techniques.
So we need to try and address these problems without getting rid of most of the
information contained in the dataset. So we found, thanks to the advice of a
very good reference to our paper, we found that people out there start
decisions and computer scientists already developed methods to treat -- to
handle these kind of situation. It's called consensus clustering and basically
it helps us to combine clustering combined from the same dataset with different
views, different set of features and different sub projections of the same
general feature space. And this is called feature distributed clustering.
All that can help us to combine clustering from slightly different small
samples of sources in the same feature space. So now I'm working from the
methodology point of view on these kind of things.
Last slide. So basically, I think from my point of view, one interesting
example of what kind of simple discovery, but I'm aiming at complex,
discovering sequences of different kinds of approaches all borrowed from KD,
can lead to astronomy. And I'm particularly proud that we were able to and we
are going to reused archival data, because it's really justifies all the force
that people out there have been doing during their whole career, trying to
determine, trying to create the protocols, infrastructure and the services that
allow astronomers to retrieve this data.
9
And, of course, the next big thing in KD applied astronomy is variability,
because so many variability-focused observation are coming up, and I'm just
thinking about optical high energy and other kind of things, of course. My
opinion on these that is we really need to focus now on these kind of methods
because very close in the future, a certain kind of astronomy [indiscernible]
will not be possible without this kind of approach and if we don't use this
kind of approach, we're going to lose a lot of interesting information and
potentially discovery.
So my last slide, I want to thank the agencies that paid my salary during the
last two years, which are the center for -- oh, jeez. The CFA, the center for
astrophysics, and the VAO, which paid part of my salary and I want to
acknowledge the very useful help from my strictest collaborators. If you want
to about things I've been [indiscernible] and understand it, you can find.
This is part of the published papers about all the different threads that I
touched in this talk. Thank you very much.
>>:
Could the next speaker come up while we have a couple questions?
>>: Let's just take variability. If it was always a real number, then you
could just put in a real number. The amplitude and magnitude and
[indiscernible] or something. But the fact is that the first question is, is
it variable at all. So first you need a binary yes, no. And then if it is,
then is amplitude can be written down. But sometimes you have better
observations, more extensive, more sensitive than others. So the absence of
variability, it's not so simple.
How do you deal with this heterogeneity from binary numbers to real variables
and the issue of quality differs from other [indiscernible].
>> Raffaele D'Abrusco: Okay. My answer is very simple. So far, we have not
tackled this very difficult issue. So far, what we did is just go in to the
position in the sky where our algorithm told us that there could be a candidate
blazar, and looking for any trace of people observing variability.
So we are basically using some kind of user generated information. Of course,
there would be large number of sources which are actually blazars but that have
not been observed in the time when people are pointing the observatories, of
course.
10
How to assess and incorporate quantitative information about variability is
something that is very hard, because you have to take into account how you
estimate whether a source is variable or not. And I have some ideas we could
use some of the estimators, diagnostics that have been developed in the
community, or you can also have to handle the fact that blazar vary at
different times and different bends.
For example, one of the ideas that is probably WISE, the regional of the ICDs
where WISE have been observed varies on very different time scales from high
energy and radio. So we're actually using some information which is an average
information about the cause of the sources which have been integrated along
ten, from eight to 40 exposures during their time, and this help us to get rid
of some of the noise introduced by variability. I'm calling it noise just from
the point of view of KD.
So it's going to be much more difficult, and we're thinking about that and
we're working on that, basically.
>>: We need to move on. Sorry. Let's thank Raffaele again. So our next
speaker is Ciro Donalek, continuing the discussion about transients.
>> Ciro Donalek: Hi again. I don't have any other [indiscernible] so I will
just stick with the automated classification of transient events and variable
sources. And these are the other people of the group, most of which are here.
George, Ashish, Matthew, Pepe Longo, and then Andrew Drake and Matt Yang are
working on these as well.
So this is a short summary. So I will start with a very short introduction to
the time domain astronomy for astronomers [indiscernible]. And then proceeding
on the problem that it poses about classification and how we are attacking it.
Like you think of [indiscernible] and the work done on future selection.
So that's a slide for Jim Gray that reminds me and you that I'm a computer
scientist and not an astronomer. And this data [indiscernible].
So briefly, why we need data mining in science. These are some examples so
that people have talked yesterday and today so now we have a better and faster
technology that is getting more and more data. And yesterday, we have seen
examples of sequencing of billions of -- comparing billions of GCAT sequence in
11
genome or the terabytes of daily data that are about to come from synoptic sky
surveys, the climate data.
And all this data is changing in nature, and what I mean is that basically, we
refine our data, we have the new data. It's heterogenous. But all of this
discipline has exact the same goal. That it's like extract knowledge. So
that's why data mining is really important. And you want to do test rapidly an
and as efficiently as possible.
And all this discipline, we end up using some data mining tools and so we'll be
doing something related to classification, clustering, regression, path
analysis and visualization. So this will be in common. But if my opinion, the
main reason why we need much better data mining tools is like to not end up
like this. That's probably [indiscernible] storage machine in the next couple
of years will be just like that.
So this is a short [indiscernible] time domain astronomy for known astronomers,
because I see there are many in the audience. So basically, it's becoming one
of the most exciting new research frontiers in astronomy. What we have now is
that we telescopes that look at a certain part of the sky over and over. So
basically, each object for the scientist is like a time series.
And this is some of the classification challenges that are posing. So, like,
realtime artifact removal. So, for example, in this case, these are three
false positives. An automatic way to remove them from our data. We need
[indiscernible] removing these artifacts using [indiscernible] neural networks.
And then, of course, there is the realtime classification, because sometimes
you want to be able to classify just few minutes of the discovery so you can
activate some robotic follow-up.
And then there is the next-day transient classification, which is mostly what
we're talking about today. And, of course, decision making because out of the
hundreds of thousands of transients that will be soon coming out from
[indiscernible], you want to choose the best one, the one [indiscernible]
because data is cheap, but full op data is still very expensive because you
need to [indiscernible].
So most systems today rely on a delayed human judgment.
will not scale with the next generation of surveys.
And that's not -- that
12
Probably the best way to illustrate what's going on is to just look at the
pictures. So on the bottom, there is the baseline sky. Let's say the sky has
we know it today. Now, on the upper part are the new observation. And this
you can see in the box, objects have changed so it's become brighter. Much
brighter, much bigger.
So just looking at the images, at the points, you cannot tell much. Because
they are basically all the same. While with full ops, doing the full ops you
can see that there are actually three very different kind of phenomena that we
found in the flare star, the dwarf nova and the blazar. So we need much more
information than just the images.
And from a computer science perspective, it sometimes [indiscernible] to deal
with the stock market because there are many classes, many classes/shares,
people in different classes. So people want super nova so [indiscernible] just
interested in blazars. So you can basically have an undemand classifier. So
classifier optimize for a given class. Of course, we want a high completeness
to maximize the gain and low contamination to minimize the losses.
So in general what we are dealing with is massive multiparametric dataset with
peta-scale ready. Data is very sparse and heterogenous, as I'll show you in a
moment, especially Catalina. We are dealing with a high number of features and
classes.
In the diagram, there are just some classification done by -- I don't remember.
On the classes that we can -- transient variable sources of classes. And the
numbers of features is now more than a hundred, 120. So in the classification,
we want to be realtime, reliable, high completeness, low contamination.
Sometimes you think the minimum amount of points. Let's say it's just a new
discovery, just with the three points you want to understand something. Even
if you adjusted to remove some of the classes so that's okay. But still you
want to do with very few points. And then, of course, we have to learn from
past experience so define our classifiers and as automated as possible.
And then we want to include external knowledge, because sometimes, especially
in astronomy, telescope, it's [indiscernible] so the condition where the image
was taken and all this information.
So this is our dataset.
And basically, we have both parameters and light
13
cover. So this is a distinction that Ashish actually made. So we can call
discovery parameters adjust to the magnitude and time. So it's just when the
actual object has been discovered.
And then there are all the contextual parameters that we can grab from archival
information that can [indiscernible], like distance from the nearest star and
galaxy, distance to the nearest radio source and so on.
Then if available, there are the follow-up colors. If there is follow-up. And
then because we have a light curve, then there is the light curve
characterization that I'll show in a moment. Of course, then, we have the
data. The class information for some of them. And data in this case is
heterogenous, unbalanced because of the certain classes we have thousand and
thousand of objects while for others we have like ten objects that we can
actually use in training the classifiers. And it's very sparse, and there are
a lot of missing data, because, of course, we don't having full ops for
everything. Some data, some archival information may be missing, and so on.
So also will be showing, I'll be using the features extracted from each light
curve using the Cal Tech time series characterization service. Basically, for
each light curve, we extract around 60, 70 parameters, periodic and
non-periodic features. And this is an example. This is a [indiscernible] with
five outburst s and this is a list of not all the parameters that we can
extract from this light curve and in which we are training our model.
So since I'm talking about all the data coming from Catalina, just a slide on
what Catalina is and the address is crts.caltech.edu. So it's a surveyor that
searches three quarters of the sky for a highly varying astronomical sources
and to find the transients. And all data is fully processed within minutes of
observation and very important thing, especially for computer scientists and
people working on this is that all discoveries are made public instantly to
enable follow-ups.
So this is a list of a problem that I was given. So the classification, the
overall classification. So to try to classify all the objects or a very
specific problem like a systematic search for CV that is doing and RR Lyrae
versus eclipsing binary dataset. And that's interesting, because the main
contaminant we're seeing RR Lyrae as tracers of galactic structures. So these
are specific problems for specific people with specific classes.
14
And the way we are doing this is using a binary classifier. So basically,
that's because different types of classifiers as I will be showing perform
better for some classes rather than other. And we can build some tree like
that just using some, start using some [indiscernible] motivated dimension
features. So super nova versus not super nova, and then we can refine the
super nova or [indiscernible] versus periodic and so on.
The classification schema have used now basically we start with the input data
that is [indiscernible] to achieve. So light curves, features, archival data.
Then each of the [indiscernible] input data because some are missing data, some
are not. Some can work with light curves, some cannot. And the
[indiscernible] used are ensemble of KNN, [indiscernible] and some decision
tree, Bayesian network [indiscernible] neural networks and supervised SOM.
And, of course, then from [indiscernible] framework, but what we get is one or
more classifiers can be escalated. So the [indiscernible] is how to compute in
the combiner and for now I'm just using the model, the weighted model. And
then we'd like to introduce the extended knowledge into [indiscernible].
And, of course, each model is on base of knowledge, feature selection, et
cetera. So the experiments framework is this one. So we start with the base
of knowledge for each model, and we do some processing. It's showing feature
selection. And then to build the ensemble [indiscernible] of classes or other
boost. And to test the quality of the boost that we say, stratified ten-fold
cross validation. Set the appropriate number of ensemble members and then
compute completeness and the contamination on non-independent test set
[indiscernible].
And this is [indiscernible] still don't know how to do that.
So the feature step has been to study rich features [indiscernible] for which
model. So we start with over 100 features. So we need to address the curse of
the dimensionality to reduce the number of features. And also because some
features may be misleading because in many cases, using all the features
produces much more results than using just a set. And then feature selection
[indiscernible] PCA, because it's often preferable when the meaning of features
are involved to want to see actually which features are involved.
Also, because eventually, we are extracting these features from hundreds of
millions of [indiscernible] so if we know which features are better we can
actually extract all the [indiscernible] to introduce [indiscernible]. And
15
then, of course, we analyze these sets with the domain scientist
[indiscernible].
So the estimating feature importance, just a few more. Sequential feature
selection, so basically on a given model, [indiscernible] estimator solve it's
like [indiscernible] and call back distances of the test.
And in the backward selection [indiscernible] maps. For example, using the
[indiscernible] dataset, in ranking algorithms, I got the first -- I got to the
features and I chose the [indiscernible]. And then the best three were period,
median observation in [indiscernible]. So when I got [indiscernible] does this
make sense to you, and he said yes because this shows the relationship between
the [indiscernible]. And [indiscernible] is important.
[indiscernible] for doing like a cross correlation. So when the maps are
equal, you can assume that the parameters have correlated.
So let's see some results. And this is like [indiscernible]. So if you see
the results from the RR Lyrae dataset. They're using binaries. This is a kind
of [indiscernible] classification. It's the dataset most used for benchmark in
data mining.
And so out of the 60 features, I run some test, referring the feature selection
algorithm and these right results. Basically, all of the methods, KNN, the
neural networks and the decision trees are [indiscernible] on this dataset. Of
course, the best one are still the decision tree and ensemble of decision tree
in neural networks.
>>:
How many data points go into this parametric database?
>> Ciro Donalek:
>>:
It's a very few.
And how few?
>> Ciro Donalek:
For this, 463.
>>:
463 photometric measurements on one object or is that 463 [indiscernible].
>>:
It's 463 [indiscernible].
16
>>:
How many photometric observations are in a typical --
>>:
250.
>>:
250 observations?
>> Ciro Donalek:
Yeah, it was in the other slide.
>>:
I'm sorry.
>>:
I can tell you more about it later.
>> Ciro Donalek:
>>:
Okay.
And are they bright and high signal, or are they noisy?
Yeah.
They're noisy.
You can see [indiscernible].
Thank you.
>> Ciro Donalek: I can switch to the [indiscernible] that's the most difficult
to do. Essentially, okay, this is entering the formula for the systematic
search for [indiscernible]. And basically, what it is [indiscernible] very
difficult to see. You can see [indiscernible].
These are the results using [indiscernible] three and the self-organizing maps.
Now, what is important to notice is that KNN were performing really, really
well. In the RR Lyrae dataset, it's like they are basically classifying
everything as a CD. So that's why we should [indiscernible] in the overall
classifier scheme, we should [indiscernible] output. Because in this case, we
know that this model is not good for blazar or CV so we should be able to
include this external knowledge in the framework.
And I also have started trying to combine, naturally combining [indiscernible].
So now I'm missing [indiscernible], networks and the other inputs. And it's
slightly better just using the [indiscernible].
Yesterday, George asked about the difference in dealing with strings and
dealing with numbers. And for some model you, can encode this just changing
the distances. So this is just yesterday, it's KNN using many different
distances and, for example, we cannot see that the [indiscernible] distance is,
of course, what [indiscernible] because it's actually made for strings to
different strings.
And then this is the future work.
So the first, you have extreme data mining.
17
There is a better way to combine outputs from the single classifiers. Getting
more data for sure, and refine existing models and strategies and add and
investigate and add more models. But some of the feature selection, but also
the work from [indiscernible] classification using clasps. And if there are
any suggestion, I appreciate it.
>>:
So while the next speaker comes up, do we have any questions?
David?
>>: So you alluded to this, but you didn't say anything further about it.
That is in most real classification tests, especially if you imagine like the
future where LSST [indiscernible] follow-up tasks, we have utility
considerations, dimension utility considerations. Sometimes follow-up is very
expensive. Sometimes there's tons of false positives, so on.
Many of these classification methods here combining have no kind of way to
incorporate utility information. But some of them do, because some of them
produce sort of quasi probabilistic or actual probabilistic output and then you
could multiply by utilities and maximize your cache flow.
And there's a kind of hidden [indiscernible].
I don't really have that much --
>> Ciro Donalek: No I know what you mean. And naturally, I skipped all the
measurement with [indiscernible]. Ashish will be talking about it. So we want
to mention that [indiscernible] is one of the problem.
>>:
[indiscernible].
>>: It's a hard problem. But really, we're facing situations we're going to
have to be [indiscernible].
>> Ciro Donalek:
>>:
Yeah, good point.
All right, a real quick one?
>>: So I wasn't clear. When you tried to assign blazars or CVs, are you just
using the time series data, or are you accessing WISE and other surveys?
>> Ciro Donalek: No, for these results, I'm using just the features extracted
from the light curve. But in dataset, we have all the features. Again, in the
dataset, we use other things.
18
>>:
Okay.
>>: Well, let's thank the speaker again.
talking about time series explorer.
So next up, we have Jeff Scargle
>> Jeff Scargle: Okay. Well, thanks for the opportunity to speak here.
Basically, this is a very simple presentation of some ideas for input to the
classification scheme or just in general things that can be done with time
series data.
Acknowledgment. Brad Jackson is the key person who, mathematician who
essentially discovered the key Bayesian block algorithm, the key to it and so
on.
And in a nutshell, here's what I'm going to talk about. Basically, there are
many algorithms, standard, old ones and some new ones that can be applied to
astronomical time series data in sort of a general context of arbitrary
sampling and uneven sampling and other data problems that all real world data
has.
These can be easily computed in some sense rapidly and displayed for large
databases. And that's a bit glib. When the size of the database is huge, that
may not be as simple.
So this then leads to more input for automatic classification as one of the
goals of this kind of thing.
And I've tried to sort of organize the different categories of algorithms.
Maybe there are some other ways of slicing this. But local structure
representations in the time domain, more global approaches in the frequency or
time scale domain in the standard power spectrum correlation functions, wavelet
power spectrum. And I think an important thing that has been used quite a bit
in astronomy, but perhaps not enough, is time frequency or time scale
distributions. Kind of a hybrid between the other two.
I'll show an example not from the main dataset I'm using here, which is the
same Catalina sky survey data that we were talking about in this session quite
a bit.
19
Basically in the time domain, there's lots of different approaches to
parametrics, symbolics representation. I'm going to stress this approach,
which has been known under the name of Bayesian blocks of namely generating
simple piece-wise constant linear or exponential models of data.
And here are some of the problems that I alluded to a bit earlier that can
stand in the way of someone faced with raw, real world data in generating these
things. Uneven sampling, gaps in the data, which is another kind of uneven
sampling, observational errors or perhaps errors that vary from point to point.
Variation in the exposure so that there's some conversion between the recorded
signal and the true brightness that can vary in some hopefully known way. And
often, you're faced with a variety of data modes.
It turns out that essentially all of these problems can be dealt with in this
time domain approach with Bayesian blocks, as I think you'll see as we go on.
Here's an example of Bayesian modeling or segmented block approach to some real
data. The horizontal axis is time in seconds. This is gamma ray flux from a
gamma ray burst. And the yellow histogram is simply a way of representing the
raw data.
The raw data are time tag events for individual photons, and the only way to
make sense graphically is to somehow bin it and that's what the yellow
histogram is with arbitrary binning.
Of course, when you pick an arbitrary bin, you never know ahead of time, are
you making the bins too big or too small. Are you smearing out real features
or are you allowing statistical fluctuations to show through.
And in some sense, the Bayesian block algorithm and the representation shown as
a blue histogram is addressing that issue. It's basically an adaptive
histogram, in a way, of the time tags for the individual photons, where the
data determines the locations of the bins and the widths of the bins. It's
completely data-adaptive.
And I just recently posted a long overdue paper with the details of this.
There's an old paper that, unfortunately, has been sitting around for, I guess,
over a decade now that is really -- I mean, I had the basic ideas there, but
the algorithm in it is obsolete. This is the new algorithm based on dynamic
programming and this is really the magic part of the algorithm that Brad
20
Jackson and the group at San Jose State came up with.
I think this concept of dynamic programming should be more well known in this
field. I'll show a little more detail about it later. But basically, it
solves the problem of finding the optimal partition of this dataset into a
finite set of blocks like this, and that's a huge exponentially large problem.
But this algorithm allows an exact solution in N squared complexity or N
squared time.
And by the way, this paper is posted in the astrophysics source code library.
There's a poster about that new repository. And in addition on the archive and
hopefully in the final publication, all of the code necessary to reproduce the
figures in the paper will be included. All the data sets that went into it and
all the code.
And this is implementing a concept that, again, I think astronomers should know
more about and use. Namely, reproducible research. This is an excellent paper
talking about the idea, basically showing that scientists have been really
fairly cheesy in the past at publishing only the tip of the iceberg or an
advertisement for results. You could almost never actually reproduce the
details in most papers.
So this is enforcing discipline of publishing everything so that anybody could
reproduce the results in detail from scratch, so to speak.
Here's -- I can't resist showing this, because this is applying the Bayesian
blocks algorithm to my favorite object, the crab nebula, which for years was
known as a standard candle assumed to be constant in flux at all wavelengths
and, of course, now we know that in gamma rays, it's variable with flares in it
and all kinds of interesting things.
The red blocks show the Bayesian blocks analysis of this set of 1,500 data
points, and that number of data points has ten to the 468 possible partitions.
And this algorithm finds the optimum from that number explicitly. I'm sorry.
>>:
Is that a map reduced step?
>> Jeff Scargle: No. That's a good point. This is sort of orthogonal to the
whole idea of map reduce. You don't have to necessarily go down a tree.
Dynamic programming is another way to cut down the tree, so to speak.
21
And it's so simple, it's almost embarrassing to show the mat lab code to do it.
Of course, there are a few hidden things. But this is the class function, the
fitness function that you use is in this function. But otherwise, the idea
behind the algorithm is completely contained in this simple mat lab code.
And if I had a few more minutes, I could actually prove to you how the
algorithm works and exactly what the dynamic programming steps do. And this is
sort of the cartoon that I would use for that. But in view of time, I'll skip
that. It's in the paper that I mentioned earlier.
People always ask, well, what are the errors in the block representation. And
I always answer, well, what do you mean? I'm delivering a whole function.
What do you mean by the errors? And there are various ways of looking at it.
This is sort of a bootstrap approach to that question, where the blocks are the
actual block solution. The solid line is kind of a bootstrap average, model
averaging with bootstrap, and then the spread is given by the thinner lines.
Another thing you might be interested in is what are the errors in the
locations of the change points, the points in time that mark the edges of the
block, and here's a simple way of looking at the posterior project for each
change point where you fix all but one change point and see how the posterior
varies.
So it's a little bit of a cheat, but not too bad.
And another thing that arises sometimes is you have data on a circle, and sort
of we all -- we pondered that for quite a long time. And if you think about
it, the algorithm has to start at an origin of time. It works recursively
starting from one data point and adding successive data points. So if you
don't have an origin, which is the case on a circle, in a way the algorithm is
fundamentally afforded. And there's a way around that, and this is just a toy
example of a histogram on a circle, angular data or something like that.
And I don't have time to go into the details of this, but there's a very nice
theoretical paper by a Stanford group, Dave Donoho and Arias-Castro on
theoretical analysis of sort of the optimal detection problem in general
deriving a theoretical kind of limit. It's an asymptotic limit for detection
of signals and noisy data, and they derive an amplitude, A1 and A2 here in two
22
different cases. A theoretical amplitude. And this is just some simulated
results showing that this is the error rate in simulated data as a function of
the amplitude of the signal relative to this theoretical result.
And the fact that you start, the error rate drops dramatically at about unity
in this ratio and it shows that this algorithm is essentially approaching the
theoretically detection limit.
Okay. And another thing you can do, of course, is look at power spectra and
correlation functions. And here's a -- this is a great paper that's fairly
old, but has been well used in astronomy for calculating correlation functions
from arbitrarily spaced data. Very simple idea, which this line of algebra
encapsulates. You basically make some bins in the lag and calculate the, in
this case, this is a cross correlation function, X and Y. You just sum the
values of X and Y at points that are separated by that particular lag.
And it's very useful, because you can start from this and calculate the power
spectrum as the Fourier transform of this correlation function.
And so this is sort of extending these ideas, you kind of can generate a
universal time series analysis machine or automatic processing thing where
you -- any data mode, and I, of course, haven't gone into the details, but all
of what I've said before and more can be done for any of the standard data
modes with any sampling in time to generate correlation functions, everything,
structure functions. Essentially, you name it. There may be some things that
are of interest that aren't on this list.
And all of this can be done, if you have two time series, you can do the
cross-versions of all of these things. And here's a sort of a matrix of
showing the same thing that essentially, you can do all these things and these
are kind of the physical interpretations of what you learn.
Are there any questions about this? I'm going to show some specific results
now. This is something that I hadn't thought of until I started working with
George and the group down at Cal Tech. This is from one of their papers. It's
a -- I don't know what to call it a delta M, delta T plot. Basically for every
pair of observations, you can calculate the magnitude change and the time
interval between the two data points and just do a scatter plot. In this case,
it's a density histogram representing that.
23
>>: End up calling is probabilistic structure functions.
be a better name.
But there's got to
>> Jeff Scargle: Maybe we should have a contest. If you couldn't hear,
probabilistic structure functions, I think. Frankly, I can't think of a worse
name for it, but maybe there is one. Because I hate structure functions is the
main ->>:
You said it's a scatter.
>> Jeff Scargle:
Why does it structure in it?
George, this is your plot.
>>:
Why is there structure in the --
>>:
It's a seasonal [inaudible].
>> Jeff Scargle: You may have the same question for things I'm going to show
now. Here is just sort of implementing the notion of a universal machine that
you put in the time series, here's just one object. This is from a several
hundred Catalina sky survey, AGNs and blazars, I think, is what I selected.
This is a case where here the raw data are shown as the points with error bars,
and the Bayesian block representation is shown as a red line. I don't know why
it doesn't -- actually, this is a nonvariable object according to the
algorithm. I don't know why it doesn't, the red part doesn't go all the way.
So then automatically, this is just a histogram of the intervals, the DT
intervals. Almost always learn something. This seems trivial, this is just
the distribution of the intervals between the data points. But almost every
time you look at that with data that somebody gives you, you learn something
interesting about the sampling or the experiment. And this is just a histogram
of the changes in the intensity. I've converted magnitudes to fluxes. But
otherwise, this is like DM.
And this is that DM/DT plot for this particular dataset. And superimposed on
it, on this color scale is a contour map of the same thing done with Bayesian
block as the input data. This is, DX and DT are calculated not for the raw
data, but for the blocked data. There's nothing here, because their Bayesian
block has no variability so that's a null thing there. But later plots will
show that.
24
Here's the auto correlation function. Almost a delta function indicating white
noise, which is kind of what's going on here. The power spectrum and the log
power spectrum both consistent with essentially white noise because this is a
nonvariable object.
I haven't systematically picked out the best case or the worst case.
kind of randomly picked a few cases to show.
I've just
Now, here's a case where there's real variability, sort of an increasing over
the time of the observations and the block representation follows it in a step
function way. There's nothing other than a step function approximation to the
data, maybe showing some shorter time scale structure here.
And here, now you see that there is enough block information to do this DX DT
plot as a contour and you see it's somewhat different from the raw data. My
claim is that the block representation has the noise kind of eliminated so it's
more the truth of what's going on in this plot. And here you see kind of
expected auto correlation. There's some structure perhaps. I don't really
think there's anything periodic in this data, but there's some fluctuations.
And here's just another case where now in this case, the Bayesian block, DM/DT
plot more or less agrees with the ones from the raw data and I think that's
just but the signal noise is pretty high so the blocks pretty faithfully follow
the data.
So, of course, this is kind of in a way crazy, because I've set up a
visualization thing for one object, and the whole problem here is there's too
many objects to be able to do this for individual objects. So what you really
want to do is to extract information of the kind that Ciro and other people
have talked about and to input to classification schemes.
And so I've done that. I've extracted a few things from the, for example, the
slope of the power spectrum and some other quantities and so you can play games
with looking -- with studying these now for the ensemble of I think there are
several hundred objects. Only the variable ones are kind of included here, the
ones that are completely nonvariable I've eliminated. And you can play around
with different things that you want to look at.
This is calculating the kind of local time derivative, just the local DX, DT
25
adjacent points in time. For the values where X is decreasing, here's the same
quantity where X is increasing. And so it's kind of obvious you should have a
negative correlation here.
This is almost a trivial plot. Here, this is the variance in X, in the
increasing part of the light curve. Sorry, decreasing. And this is the
variance in the increasing part. And I think the asymmetry in this scatter
plot is saying something about the underlying physical process.
And so I'm done. Here's -- well, just a couple of quick things. This is a
histogram of the slopes of the power law, power spectrum into X, and I think -okay. One last thing. This is completely different data. This is some solar
data, chromospheric activity over the last three solar cycles. You see the
solar cycle, but there's a lot of detail. If you do a time frequency analysis,
almost magically, this is -- these peaks in the time frequency distribution
frequency this way in cycles per day, this is the date. This is differential
rotation, the fact that the active regions on the sun are adrift in latitude
just like sun spots do, so you see different rotation.
This, if you look at an year all power spectrum of that data, you never see
this. But the time frequency analysis is very sensitive to spectral features
that are varying with time. So I think this is a really neat thing to have.
Well, I think I'll end there.
>>:
Thanks very much.
Can our next speaker come up while we take a few questions.
>>: Your Bayesian blocks was originally written for events data, like gamma
rays. Can it be used for, let's say, optical photometric video with
[indiscernible]?
>> Jeff Scargle: Yes. The newer paper has that explicitly. Has all the
details for that. The bin counts, events and the kind of data you were talking
about.
>>:
Real numbers?
>> Jeff Scargle:
>>:
Yeah, right.
How did you do the fitness function for the programming.
Is there a
26
natural way of choosing the fitness function?
>> Jeff Scargle: It's kind of up to you. We worked out several. Basically,
you want something that expresses how well the data are -- how constant the
data are over an interval. So just the simple variance of the data is not too
bad. But better is a maximum likelihood or posterior probability for the
Poisson model in the case of event data. So there's some choice there. What
you have to have is additivity. The fitness function for a whole interval has
to be the sum of the fitnesses of individual blocks.
>>:
The results would change, right, if you change that?
>> Jeff Scargle: Yeah, a little bit. I don't think they change much if you
have a reasonable quantity measuring constancy over the interval.
>> Ani Thakar: So this is a work in progress. Since Alex already introduced
this, the SS SkyServer yesterday, I don't need to spend any time on that, and
also Curtis spoke about it this morning.
Okay. So this is meant to be a pun and George didn't like it so he put me in
the wrong session just before coffee. Anyway, so this is ten years of
collected log data, both web hits and SQL query data for the SkyServer. This
is an effort we're just beginning, like I said. Actually, SQL part of it that
I'll talk about here is, we're just starting that. But Jordan Raddick has been
working on basically doing the ten-year SQL to the first traffic report, which
I'll describe in a minute. And he's been doing most of the heavy lifting here.
And, of course, Jim is the person who got us started on this whole adventure
and he insisted this we keep everything, log everything from day one. So this
is really still his show. So the first traffic report which a student working
with Jim brought together and covered the first five years, 2001 to 2006 of the
SkyServer. And Jordan's made a short URL for this in case anybody wants to
look it up.
So, I mean, this cannot do justice to it. It's really got a lot of rich data
in this first traffic report, even for the first five years. And if you're
interested, you should take a look at it. But just very partial highlights.
The web and SQL traffic during this period was doubling every year. That
trend, of course, has not kept up. Things have flattened out considerably.
Hundreds of astronomers during this time basically graduated from using either
27
the canned form queries or the sample queries on to using more free-form SQL.
We also noted a flurry of activity after each release. So this is basically
showing DR1 through DR4 and how there were spikes in the activity. One of the
difficulties from this first report was how to separate the traffic from
programs or what we call bots and actually human users and this is kind of the
best attempt to do that.
And so the total web hits and SQL queries and what we call web sessions. Web
sessions are basically where the interval between subsequent page hits was more
than half an hour. That's where the session ended. So within that, we call
that a single session.
So for mortals, most of the activity is here in the web hits. Then SQL
queries, percentage-wise, it's a very large personal, actually I think this
figure might be wrong. It should be, these two should basically, these three
should add up to here.
So anyway, and spiders are programs that crawl the website in order to develop
an index, to create an index to be differentiated from bots, which are actually
programs that hit the SQL database in a very heavy way. And actually, if you
want to look at the percentages, so in terms of -- so for the web hits, the
mortals are the main contributor and then bots are the second largest, and then
there are these spiders that are basically trying to create -- they essentially
go to the robot text file.
And then for the SQL queries, you can see that the bots really are the lion's
share of the SQL queries and then the mortals are here and then very few from
the spiders. And in terms of the session, it's kind of equally divided between
the spiders and the mortals, and then very little from the bots.
>>: So sessions, you mean they go their once and do something and they go
away, or is it how long they've spent when they go?
>> Ani Thakar: It's basically when they have successive page views and the
page views are within half an hour of each other. So the time is defined as
half an hour between for the maximum.
>>:
So the bots [indiscernible].
28
>> Ani Thakar: So the bots are basically programs, for example, people trying
to get the whole data download, a new release some of these datacenters or even
people who have scripts are wanting to do something similar.
So Alex showed this yesterday, and this is basically the ten-year traffic kind
of at a glance. So this is, these are the web hits and this is the SQL, the
dark SQL traffic. One thing he didn't mention is that these two big things,
spikes here in the web hits was the galaxy zoo release in July 2007, which as
he mentioned we had to add servers very quickly in order to handle. This was
like 45 million hits in a month. But most of that was over a period of a week.
This is actually unreal, this 39 million queries in October 2008. Actually,
this occurred over just -- most of that was one day and I'll get to that in a
minute. This is the SQL to the five-year report that we're preparing, and we
plan to extend the original analysis to the larger dataset. But in terms of
the web hits and SQL queries, which were both in that first report, we want to
separate those into different papers now.
So we can do more justice to the SQL query analysis, and that's basically what
I'm going to be talking about here. So this is quite a unique dataset. I'm
not aware of any other such large time interval dataset for SQL query usage,
and I think it's a very good way to try to determine how data intensive science
is done.
So the question that we really want to ask, starting with who's actually using
this, who's using this SkyServer and CasJobs serve SQL tools that we've made
available, how often they're using it and how are they using it. Are they
getting better at it with time. How complex are the queries, are the
complexities increasing, and how are users using SQL. A very important aspect
of that is what type of science is being done. This is going to be, you know,
more extended -- more difficult thing to determine.
And is the system meeting the requirements, how can we improve it and how
effective is online help that we've provided.
So just a few numbers, sort of aggregate numbers. Total number of SQL queries
to date is 194 million. Out of those, 68 million are unique queries. These do
a select distinct on the query actual text. Of these total number of queries,
145 million actually succeeded, meaning error return was zero, and that many
failed.
29
So in terms of users, the top five SQL users, these are all bots or programs,
no surprise there, because they're in the millions, no human can do that many
queries, even over ten years. So the top prize goes to university of Victoria,
the CDC. And, in fact, most of this was on a single day. October 23rd, 2008.
This was actually just before the DR7 release, so the data was online but
wasn't publicly announced. And most of these queries, also no surprise, did
not succeed because they were just too fast. So about 360K did succeed over
that one day.
>>:
When you say [indiscernible].
>> Ani Thakar: No, I mean, they were just shooting the queries too fast. I
mean, there wasn't time to get anything back. So I think there was not enough
interval between successive queries.
>>:
Do you have a policy on just, like, how much you can [indiscernible].
>> Ani Thakar: Yes, actually, the SkyServer has a throttle of 60 queries per
minute, and interestingly enough, we ran interest a problem that a teacher
using the SkyServer in one of their class exercises and all the students were
hitting the buttons at the same time. So we had to make an exception for them.
But actually, it was Jim who put in that throttle. I remember once he sent a
stern email to the person who was submitting a lot of queries.
So starting to look at the distribution of users, and I think this is along the
lines of what you were asking, curt. But you I think not enough detail for
what you want. These are the organizations that we've detected from the IP
information that we have. And so this is the kind of catch-all other category.
But these are the university, the non-university colleges. This is K through
12. This is other information service providers. This is the national
government institutes and then other government regional. That's basically,
you know, dividing the kind of institution that the queries are coming from.
In terms of the web hits, this is the kind of distribution. Again, the colors
are the same. Color coding is the same. Other is the main category again, but
quite a few queries from university sources and government. This is quite a
lot of this is from Berkeley lab, et cetera. And then SQL queries, here, you
know, university wins by quite a large factor and then there is the
30
unclassified category.
of that.
But everything else falls into a fairly small portion
So the thing that we're trying to do is in terms of figuring out what kind of
complexity we are seeing in the SQL queries, a very naive way to do this would
be to look at the length of the query, but this doesn't tell us too much.
Sometimes it does, but there are quite a few types of long queries that you
could write that aren't very sophisticated.
Then the next thing we do is look at the numbers and types of joins look at
whether people are using group by and order by constructs. And then more
advanced things like the cross joins and cross applies, which are more recent
additions to the language and then the old cursor, which was doing some of what
these do.
So those are more sophisticated kind of queries. And then, of course, how are
people using the user-defined functions. It also is a type of function. Some
functions are quite simple, but then there are others that do quite a lot more.
And then this is also a good way to detect what kind of science people are
doing. And then, of course, combinations of these things.
So we defined SQL templates. So this is basically, we divide queries into
templates. First of all, these focus on the successful queries only, so these
are the unique successful queries of 69 million. And then do a replace of all
the numbers in the query in the SQL with a hash sign and then do a select
distinct from the resulting SQL statements and then assign each template that
we get from this template ID. This results in just under a million query
templates and from this we start to derive this crude complexity index.
Again, like I said in the previous slide, based on the presence of certain SQL
elements. These templates are, besides classifying queries into classes or,
you know, templates, it's really useful because there are only a million
templates and querying them is a lot faster than querying the whole, the entire
database, especially when you're doing text searches.
So this is how you query a template. This is a SQL code we created. We wrote
this reg because SQL doesn't have a regular expression replacement or search
facility. So this is something we wrote in C-Sharp and then we, you know,
squeeze out all the numbers, replace them with hashes and then squeeze out
white space comments, et cetera, and group by whatever is the result of this
31
replacement.
>>:
Okay.
So how am I doing for time?
Couple minutes.
>> Ani Thakar: Okay. So as far as the SQL constructs go, this is where all
the join, other join, different types of joins, group by order. This is how -this is the templates and then the lighter bars are the aggregates and the
entire dataset and this is basically the queries we use to get these numbers.
So then the line is where I basically divide the number of queries by the
templates. That gives you some idea of how popular, how frequently a given
template is used and this is the kind of distribution we see. Haven't really
done much in terms of analyzing this, but that's the kind of data we have right
now.
The length of the queries, this is, again, not going to tell us too much, but
the vast majority of queries are under 100 bites. This is 200, 300, 400
thousand bytes. So nearly 1K is small fraction of that. And most of the bot
and program queries are very small. And, of course, there is actually a limit
on the query length in the SkyServer so that kind of restricts this as well.
A couple of other studies that have been done along similar lines, there is a
thesis at Drexel which created this Java log viewer and there was a kind of a
feature called sky map, which are the spatial coverage of queries and then
there was interactive exploration of the SQL logs with color-coded SQL elements
and also statistics viewer. So you could actually view the color coding for
different SQL constructs here, which was very useful.
Unfortunately, by the time this was finished, it wasn't hooked up to the live
SkyServer log database. It was still working on a downloaded snapshot so that
really restricted how useful this was.
And then Nolan Li, who developed the CasJobs, MyDB service, he also did some
analysis of CasJobs queries and studied how, you know, users do data-driven
analysis. And two things he studied was the number of MyDB objects per query,
so these are basically MyDB tables or any other kind of objects per user, per
query. And then the number of linked MyDB objects created from queries, which
is a better measure.
And only 38% of users had only one dependency, but these users were responsible
32
for 76% of the work flows, so they were quite active.
So the next steps try to get templates for the sample queries to see how much
the sample queries are being used. Refine the complexity index and then track
complexity as a function of time. Try to track SQL session. This is going to
be difficult this is, of course, much more relevant for cas job users because
there are facilities in MyDB to actually do this kind of extended analysis.
See how people are using built-in indices and the hierarchical triangular mesh,
HTM spatial index and also get more detailed user demographics, meanings things
like what kind of users they are. This is something that Kirk was asking
about, users who are scientists or members of public or what kind of, you know,
professional or amateur astronomers, et cetera. That's it. Thank you. This
is the up to the hour traffic site for the SkyServer, if anybody wants to take
a look.
>>: So we do have time for one question, which I might start with. Did you
ever consider way back when putting in place a registration system and what
effect do you think that would have had? Obviously, probably would have
reduced the number of people using it, but maybe it it would have also reduced
the number of spurious queries and errors.
>> Ani Thakar: We have log-in for CasJobs so you have to have an account for
CasJobs user, which is then you get your own database and then you can run
queries against that database.
But for the SkyServer, no, it's just a browser based thing and it's public.
we don't really require any kind of registration for that.
>>:
Do you think it may have reducing spurious traffic or --
>> Ani Thakar: I don't think that's really a big problem.
spurious traffic, what exactly do you mean?
>>:
So
When you say
Well, there were a lot of, I guess, just hammering the system.
>> Ani Thakar: That's actually legitimate users. Just that people are trying
to download the data. That may not be the best way to do it, but that's not
necessarily illegal. We want people to do that, just maybe not --
33
>>:
Slightly slower.
>> Ani Thakar:
Yeah.
>>: So [indiscernible] is SQL still the way to go if you're designing
[indiscernible], is this framework still scaleable for the larger stuff?
>> Ani Thakar: Well, so here we're kind of cheating, right. It's not just
SQL, it's a unique set of functions and procedures we're using. Plus we've
built in this spatial index. So all these kinds of add-ons you need to make
SQL work for this kind of data. And that's like one or two orders of magnitude
more, at least. So I don't know whether databases will scale to that or you
will maybe side DB or something like that. I don't know.
>>: All right. We have a coffee break now.
after an announcement.
Can we be back in 20 minute,
Download