1

1 >> Raffaele D'Abrusco: Thank you. of what I mean by work flow, okay? Let's start with the very name definition Sometimes it's useful to define boundaries. So I'll just show you how we can use different kind of techno methods borrowed from statistical analysis, advanced statistical analysis and machine learning to address a single problem. Now, these kind of tools can be chained together in order to cover and answer different questions about, in this case, how we can select blazars and how we can characterize the spectral [indiscernible] distribution. So if you have a good work flow, from my point of view, what you're going to do is being able to find new correlations, interesting things that nobody noticed before or generalize already known correlations to a higher dimensionality. Just think about adding new colors or new observable to something, some relation that we already know. Or find simple correlation that for some reason has been overlooked so far. And it still happens. And the thing I'm going to discuss in a couple slides is an example of this simple correlation that were not known before the application of KD methods. And, of course, you want to further your work and you want to use this discovery or correlation that you've found by applying the knowledge that you have -- sorry, applying other KD methods to classify sources, to extract new knowledge and to reduce or cover data that are sitting out there and it's something that's really close to my heart because I'm working, have been involved in the virtual observer so I really aim to be able to reuse data that are sitting there in the archives and are available to astronomers. As Pepe Long mentioned in a previous talk, I have been his student for a while during the Ph.D., and we were focused about two simple, let me tell you, simple in the sense that they can be easily defined in terms of a KD application task. Which is the classification of kind of the quasars, the extraction of optical quasar based only in photometric data, and the determination of photometric redshift, as Pepe already discussed. I have defined in this slide a couple of works that my collaborator and I have been working on the last three years. And the basic idea, at a very general level, what we did is applying unsupervised clustering algorithms to determine the better, best, sorry -- the better, the best distribution sourcing to different groups in usually high dimensional from the space in order to 2 optimize the training of regressors in the case of photometric redshift or classifiers for extracting quasars and distinguishing quasars from stars is concerned. So this work, very two simple tasks, but they can be tackled using really different combination of KD methods. For example, in the first -- in this example, which is a description of the weak gated expert that [indiscernible] and I have been working on last year, published last year, we have two layers of KD. Three layers, actually. The first one is unsupervised clustering in the coral space of quasars or galaxies. The second layer is what we call the first layer of experts. We have regressors, supervised trained experts that learn to recognize patterns in the data linking the cause of the sources to redshift. And then we have the third and last layer, where a gating [indiscernible], which is another neural networks, learn how to combine different outputs from the second layer experts in the best way. Let me tell you that the most interesting thing is this one, the most two interesting things that I want to tell you is that this method does not require any fine tuning for working on different kind of sources. We applied it to galaxies and quasars, and we all know that the spectral array distribution and the correlation between colors and redshift are quite different, because different are the additional mechanisms at play. And second thing, we use neural networks because they were simple and we had a very narrow kind of parameters so we could play with a small set of parameters for each given expert. But every single expert could be any kind of regressor or any kind of supervised tool. We could always add template fitting methods as single experts and combine them. So in general, it is expendable and can be used to address potentially other kind of problems involving simple classification or regression. Well, so the question is, can we extend this kind of approach to a more general equation. So can we use unsupervised clustering to try and find whether any kind of pattern correlation between a set of observables that we used to cluster, that we used to build the feature space where we apply or cluster algorithms and some other observable that we are not using to try and understand whether some interesting signal is present in our data. These kind of equations that is related [indiscernible] distribution of sources inside 3 each cluster relate to the distribution rel to some outside observables, we try to answer to this equation, generalizing the approach that I showed you in the slide before and coming up with the CLaSPS method. Basically, we have three simple steps. We decide when we have a lot of observables, usually have very large number of observables, we decide which will be used to create, build our space where we actually perform the clustering. And let's say that these are the colors, because most of this work was inspired by the main use case which was trying to link the classification which usually is obtained by spectroscopic data or looking at the absence or presence of a missionary x-ray, radio gamma rays to the actual shape of the SED of sources, particularly [indiscernible] we tout fitting the SEDs. This was the point, just looking. And now the points are distributing some parameter space. So we have this first layer where we perform clustering, and then we apply some [indiscernible] measure of how distributive members in each classes relative to some other observables which we have not used for clustering using some mathematical trick and I will show you the number, the kind of diagnostics that we define to do that. Then we try to understand whether the patterns that we find are due to some instrumental effects or just noise in our data or they actually are [indiscernible] that can be used for astronomy. The basic, the core of this method is this number here which we call score for the absence of any better ideas. It's basically the measure of the number and members of a cluster that belongs to each [indiscernible] which we are dividing them distribution of labels. So I'm calling features the observables that we use for the clustering. And labels, the things that we attach to the single members of the clusters to understand how these clusters are formed. Let me do a very simple example. If we have stars and quasars, just as I told you before, the clusters will be just binary clusters, zero, one, stars and quasars, and I'm telling one quasars and zero stars, because we're interested in quasars. But it will be all the opposite, of course. But the key point is measuring this FID, which is the fraction of sources in the [indiscernible] cluster which belong to the [indiscernible] of the label distribution. In this case, zero one. Then we wait this for the number of clusters. We evaluate this number for each cluster we are clustering and for different clusterings, where the differences between clusters can be the total 4 number of clusters and the method that we use to achieve this clustering. And these are very simple, two-dimensional representation of the high dimensional structure of the clusters in terms of the scores. And then we try to find higher values of this number, which tell us that something is going on. And then we try to determine whether this [indiscernible] is real or not through simulation, okay? So our first interesting application of this method was just trying to determine, as I told you before, if you can find new ways to characterize the high dimensional distribution of sources in the color space, especially AGNs, using as labels the spectroscopic classification of the absence or presence of x-ray [indiscernible] quietness and other things like that. So we use basically a large AGN catalog where we add the labels that told us whether a guy was, a source war blaze ar or was a [indiscernible] one, two. If it had radial emission or not, and so on. And we use these features, very narrow data from a very narrow range in the electromagnetic spectrum. As you see, we use feature points that are located in this region, basically going from the median [indiscernible] WISE to the [indiscernible], the galaxy. Of course, these, all this bunch of point is not representative of the features that we actually use, but just to show you the range, the spectral range where we worked. And we used as labels information that are coming from regional DSDDs that are far away from the region where our features are located. The interesting thing that we had, apart from patterns that did not lead us to any place interesting, is that we found that the particular family of sources, the blazars, had the very strict correlation with two particular labels, which were the emission, the gamma ray emission, the fact that they could be seen in gamma ray skies, thanks to firm data, and the subclassification, spectral classification which was available for blazar in flat spectrum radio quasars and BL Lacs. I can just say that BL Lacs are the real blazars, okay. The prototypical. While flat spectral radial are sources this show most of the blazar straights but in some way are contaminated by other components, in terms of the SED or the sources. And we nailed down this correlation to basically few, very small number of 5 classes that were custom through all the classes that we could perform, using different methods. Let me also tell you that the number of source in this case was quite small and the dimension of the space was 12 or 11. So not that large. But interesting thing is that we could recognize there was something constant along all the clusterings that we did. So how could we understand what was going on? We basically did what any astronomers would have done in that case. After find there was something strange, had a large value of the score, so we basically run PCA, personal principal encumbrance and we determined that basically the correlation between these two labels and the distribution of blazars in our high dimensional parameter space could be justified in terms of data distribution in terms of the medium Fred-WISE color space. So going from an 11-dimensional space to a three-dimensional space, which is something that anybody else could have done in principle, but for some reason probably my explanation is that the strongness of bias not had been done before. So we applied some automatic way to track down correlation in the larger or the smaller projection of the parameter space and we determined basically the region, the SEDs of these sources was responsible for this correlation. And the interesting thing is that if you try and plot in a very simple two-dimensional projection, the color space generated by Y filter, C23, C12, the distribution of blazars in our sample, you can see that they occupy a very peculiar narrow regional parameter space. Let me tell you that the density here is basically zero. So all the sources that we can find in a given region of the sky, looking at the WISE photometric catalog that are located here are blazars. The contamination is between zero and two, three percent. And this region is occupied by the BL Lacs, the regional prototypical blazars. In this region, we have a larger contamination, of course, where most of the quasars are usually fine in the color space by WISE, but we're talking about 20, 25 percent contamination. That's a very good number, if you want to produce a list of candidate blazars that you can use in full op. So the first thing that we did was trying to model this locus, what we call 6 three-dimensional locus of points in order to use this locus to extract a new candidate blazar from the WISE catalog or confirm whether gamma ray emission, which was not associated to any source, known source, could have been justified by the presence of candidate blazars. And we did what [indiscernible] would have done in space. We produce a very simple modeling, just imposing that our boundaries could contain 95 percent of the sources. And this worked quite well, because we were able to [indiscernible] to candidate blazars 156 of the unidentified gamma ray sources from the [indiscernible] catalog. Of course, we are not saying that 156 source are actually blazars, because we need photo op spectroscopic observation to confirm it 100 percent. But I'm just saying that as some people speculated that based only the density surface base to blazars, a large fraction of these unassociated sources could have been associated to actual blazar. That's a good thing and we are planning -- basically, we ask for a [indiscernible] to confirm this. And then came up to my mind, we are doing it in the wrong way, because we are basically giving a very simple model. We need now to think in terms of data mining. So what we did is treating the largest, the modeling of the largest as a supervised classifier that can be -- that has to be described in a very small number of parameters that can be fine tuned in terms of the quantitative numbers, let's say the numbers that we are interested in, which is the ability to reconstruct the shape of the SED, the efficiency of the selection process, and the completeness of the selection process and the other constraints that we put -- that we decided to impose in order to be extendable. Catalogs of blazars change with time. You basically have found WISE from the preliminary release [indiscernible] changes something. We want to be able to repeat [indiscernible] very quickly and automated way going over. And basically what we did is finding that we can work very effectively in the principal component space and we construct quantitative measure which we call, not surprisingly, score, again. Now, this is just like fantasy on my site which tells us which is the probability that our score is compatible with the largest of the source. Of course, the largest number, the largest probability that the course of this WISE source, unidentified WISE source is a blazar. 7 Again, this does not mean that they are [indiscernible] basis, but this is a huge step forward in order to determine which source should be observed to [indiscernible]. And, of course, this kind of modeling also help us to perform special unconstrained certain. So far, we have looked into the region of the sky where gamma ray emission has been observed and where something tells us that there might be blazars. But what about just filtering all the WISE catalog looking for sources that satisfy our constraints on the WISE color space and see whether in a second moment these can be associated to something that smells like blazars in their data. And that's exactly what we did. I'm just going to focus on this second application which is much more interesting in terms of reusing color data. We know that blazars are variable all over the [indiscernible] spectrum. So we went through the [indiscernible], which are these little nodes that astronomers that observe a transient in the sky write about what they see, and we looked for sources in the region of the sky where these transits were observed that could have been associated with our candidate. We found that out of the 500 and something, we could reassociate to candidate blazars 50 of them, and for 12 of them, which is very large number, we found spectroscopic data somewhere else that helped us to confirm their nature. Other people had observed samples of [indiscernible] and they basically found samples which, sources where spectral did not fit into the usual scheme of Siefert, and they left it there. So we went through the archive and we found that this blazar, this spectral are consistent with our picture. That tells us that these sources are blazars. So why this kind of experiment? Because we could apply these to the huge availability archives that are out there. I'm thinking about optical variability. I'm thinking about the transient that have been observed in the same gamma ray sky but have not been associated to any source in the catalog just because they last a few hours, a few days, and they didn't make the cut in terms of [indiscernible] ratio. So could we enlarge the number of known blazars? I'm positive that we can, and this is a never ending number of applications. But the interesting thing is that we could also challenge -- a challenge is made to mark more complex dataset. 8 A colleague of mine with whom I'm collaborating has been awarded a very long-term time on Chandra to observe a larger region containing the cosmos, the Chandra cosmos region so we have will a very large encompassed dataset to apply this kind of method. We are talking about [indiscernible] information classification coming from the [indiscernible] spectrum, and this is a very -quite almost perfect dataset to test the [indiscernible] of this method. Of course, we need to improve the method, because now we are handling with the real dataset, in the sense that the previous dataset was picked. We chose only the sources for which we had all measurements and all [indiscernible], for example. I'm not going to lie to you. Our dataset was a test dataset that led to interesting discovery, but we hand pick our sources. In this case, we're going to handle something which has not a lot of numbers, no measurement upper limits and the kind of things that really make real data so interesting and so difficult to handle from the point of view of data management techniques. So we need to try and address these problems without getting rid of most of the information contained in the dataset. So we found, thanks to the advice of a very good reference to our paper, we found that people out there start decisions and computer scientists already developed methods to treat -- to handle these kind of situation. It's called consensus clustering and basically it helps us to combine clustering combined from the same dataset with different views, different set of features and different sub projections of the same general feature space. And this is called feature distributed clustering. All that can help us to combine clustering from slightly different small samples of sources in the same feature space. So now I'm working from the methodology point of view on these kind of things. Last slide. So basically, I think from my point of view, one interesting example of what kind of simple discovery, but I'm aiming at complex, discovering sequences of different kinds of approaches all borrowed from KD, can lead to astronomy. And I'm particularly proud that we were able to and we are going to reused archival data, because it's really justifies all the force that people out there have been doing during their whole career, trying to determine, trying to create the protocols, infrastructure and the services that allow astronomers to retrieve this data. 9 And, of course, the next big thing in KD applied astronomy is variability, because so many variability-focused observation are coming up, and I'm just thinking about optical high energy and other kind of things, of course. My opinion on these that is we really need to focus now on these kind of methods because very close in the future, a certain kind of astronomy [indiscernible] will not be possible without this kind of approach and if we don't use this kind of approach, we're going to lose a lot of interesting information and potentially discovery. So my last slide, I want to thank the agencies that paid my salary during the last two years, which are the center for -- oh, jeez. The CFA, the center for astrophysics, and the VAO, which paid part of my salary and I want to acknowledge the very useful help from my strictest collaborators. If you want to about things I've been [indiscernible] and understand it, you can find. This is part of the published papers about all the different threads that I touched in this talk. Thank you very much. >>: Could the next speaker come up while we have a couple questions? >>: Let's just take variability. If it was always a real number, then you could just put in a real number. The amplitude and magnitude and [indiscernible] or something. But the fact is that the first question is, is it variable at all. So first you need a binary yes, no. And then if it is, then is amplitude can be written down. But sometimes you have better observations, more extensive, more sensitive than others. So the absence of variability, it's not so simple. How do you deal with this heterogeneity from binary numbers to real variables and the issue of quality differs from other [indiscernible]. >> Raffaele D'Abrusco: Okay. My answer is very simple. So far, we have not tackled this very difficult issue. So far, what we did is just go in to the position in the sky where our algorithm told us that there could be a candidate blazar, and looking for any trace of people observing variability. So we are basically using some kind of user generated information. Of course, there would be large number of sources which are actually blazars but that have not been observed in the time when people are pointing the observatories, of course. 10 How to assess and incorporate quantitative information about variability is something that is very hard, because you have to take into account how you estimate whether a source is variable or not. And I have some ideas we could use some of the estimators, diagnostics that have been developed in the community, or you can also have to handle the fact that blazar vary at different times and different bends. For example, one of the ideas that is probably WISE, the regional of the ICDs where WISE have been observed varies on very different time scales from high energy and radio. So we're actually using some information which is an average information about the cause of the sources which have been integrated along ten, from eight to 40 exposures during their time, and this help us to get rid of some of the noise introduced by variability. I'm calling it noise just from the point of view of KD. So it's going to be much more difficult, and we're thinking about that and we're working on that, basically. >>: We need to move on. Sorry. Let's thank Raffaele again. So our next speaker is Ciro Donalek, continuing the discussion about transients. >> Ciro Donalek: Hi again. I don't have any other [indiscernible] so I will just stick with the automated classification of transient events and variable sources. And these are the other people of the group, most of which are here. George, Ashish, Matthew, Pepe Longo, and then Andrew Drake and Matt Yang are working on these as well. So this is a short summary. So I will start with a very short introduction to the time domain astronomy for astronomers [indiscernible]. And then proceeding on the problem that it poses about classification and how we are attacking it. Like you think of [indiscernible] and the work done on future selection. So that's a slide for Jim Gray that reminds me and you that I'm a computer scientist and not an astronomer. And this data [indiscernible]. So briefly, why we need data mining in science. These are some examples so that people have talked yesterday and today so now we have a better and faster technology that is getting more and more data. And yesterday, we have seen examples of sequencing of billions of -- comparing billions of GCAT sequence in 11 genome or the terabytes of daily data that are about to come from synoptic sky surveys, the climate data. And all this data is changing in nature, and what I mean is that basically, we refine our data, we have the new data. It's heterogenous. But all of this discipline has exact the same goal. That it's like extract knowledge. So that's why data mining is really important. And you want to do test rapidly an and as efficiently as possible. And all this discipline, we end up using some data mining tools and so we'll be doing something related to classification, clustering, regression, path analysis and visualization. So this will be in common. But if my opinion, the main reason why we need much better data mining tools is like to not end up like this. That's probably [indiscernible] storage machine in the next couple of years will be just like that. So this is a short [indiscernible] time domain astronomy for known astronomers, because I see there are many in the audience. So basically, it's becoming one of the most exciting new research frontiers in astronomy. What we have now is that we telescopes that look at a certain part of the sky over and over. So basically, each object for the scientist is like a time series. And this is some of the classification challenges that are posing. So, like, realtime artifact removal. So, for example, in this case, these are three false positives. An automatic way to remove them from our data. We need [indiscernible] removing these artifacts using [indiscernible] neural networks. And then, of course, there is the realtime classification, because sometimes you want to be able to classify just few minutes of the discovery so you can activate some robotic follow-up. And then there is the next-day transient classification, which is mostly what we're talking about today. And, of course, decision making because out of the hundreds of thousands of transients that will be soon coming out from [indiscernible], you want to choose the best one, the one [indiscernible] because data is cheap, but full op data is still very expensive because you need to [indiscernible]. So most systems today rely on a delayed human judgment. will not scale with the next generation of surveys. And that's not -- that 12 Probably the best way to illustrate what's going on is to just look at the pictures. So on the bottom, there is the baseline sky. Let's say the sky has we know it today. Now, on the upper part are the new observation. And this you can see in the box, objects have changed so it's become brighter. Much brighter, much bigger. So just looking at the images, at the points, you cannot tell much. Because they are basically all the same. While with full ops, doing the full ops you can see that there are actually three very different kind of phenomena that we found in the flare star, the dwarf nova and the blazar. So we need much more information than just the images. And from a computer science perspective, it sometimes [indiscernible] to deal with the stock market because there are many classes, many classes/shares, people in different classes. So people want super nova so [indiscernible] just interested in blazars. So you can basically have an undemand classifier. So classifier optimize for a given class. Of course, we want a high completeness to maximize the gain and low contamination to minimize the losses. So in general what we are dealing with is massive multiparametric dataset with peta-scale ready. Data is very sparse and heterogenous, as I'll show you in a moment, especially Catalina. We are dealing with a high number of features and classes. In the diagram, there are just some classification done by -- I don't remember. On the classes that we can -- transient variable sources of classes. And the numbers of features is now more than a hundred, 120. So in the classification, we want to be realtime, reliable, high completeness, low contamination. Sometimes you think the minimum amount of points. Let's say it's just a new discovery, just with the three points you want to understand something. Even if you adjusted to remove some of the classes so that's okay. But still you want to do with very few points. And then, of course, we have to learn from past experience so define our classifiers and as automated as possible. And then we want to include external knowledge, because sometimes, especially in astronomy, telescope, it's [indiscernible] so the condition where the image was taken and all this information. So this is our dataset. And basically, we have both parameters and light 13 cover. So this is a distinction that Ashish actually made. So we can call discovery parameters adjust to the magnitude and time. So it's just when the actual object has been discovered. And then there are all the contextual parameters that we can grab from archival information that can [indiscernible], like distance from the nearest star and galaxy, distance to the nearest radio source and so on. Then if available, there are the follow-up colors. If there is follow-up. And then because we have a light curve, then there is the light curve characterization that I'll show in a moment. Of course, then, we have the data. The class information for some of them. And data in this case is heterogenous, unbalanced because of the certain classes we have thousand and thousand of objects while for others we have like ten objects that we can actually use in training the classifiers. And it's very sparse, and there are a lot of missing data, because, of course, we don't having full ops for everything. Some data, some archival information may be missing, and so on. So also will be showing, I'll be using the features extracted from each light curve using the Cal Tech time series characterization service. Basically, for each light curve, we extract around 60, 70 parameters, periodic and non-periodic features. And this is an example. This is a [indiscernible] with five outburst s and this is a list of not all the parameters that we can extract from this light curve and in which we are training our model. So since I'm talking about all the data coming from Catalina, just a slide on what Catalina is and the address is crts.caltech.edu. So it's a surveyor that searches three quarters of the sky for a highly varying astronomical sources and to find the transients. And all data is fully processed within minutes of observation and very important thing, especially for computer scientists and people working on this is that all discoveries are made public instantly to enable follow-ups. So this is a list of a problem that I was given. So the classification, the overall classification. So to try to classify all the objects or a very specific problem like a systematic search for CV that is doing and RR Lyrae versus eclipsing binary dataset. And that's interesting, because the main contaminant we're seeing RR Lyrae as tracers of galactic structures. So these are specific problems for specific people with specific classes. 14 And the way we are doing this is using a binary classifier. So basically, that's because different types of classifiers as I will be showing perform better for some classes rather than other. And we can build some tree like that just using some, start using some [indiscernible] motivated dimension features. So super nova versus not super nova, and then we can refine the super nova or [indiscernible] versus periodic and so on. The classification schema have used now basically we start with the input data that is [indiscernible] to achieve. So light curves, features, archival data. Then each of the [indiscernible] input data because some are missing data, some are not. Some can work with light curves, some cannot. And the [indiscernible] used are ensemble of KNN, [indiscernible] and some decision tree, Bayesian network [indiscernible] neural networks and supervised SOM. And, of course, then from [indiscernible] framework, but what we get is one or more classifiers can be escalated. So the [indiscernible] is how to compute in the combiner and for now I'm just using the model, the weighted model. And then we'd like to introduce the extended knowledge into [indiscernible]. And, of course, each model is on base of knowledge, feature selection, et cetera. So the experiments framework is this one. So we start with the base of knowledge for each model, and we do some processing. It's showing feature selection. And then to build the ensemble [indiscernible] of classes or other boost. And to test the quality of the boost that we say, stratified ten-fold cross validation. Set the appropriate number of ensemble members and then compute completeness and the contamination on non-independent test set [indiscernible]. And this is [indiscernible] still don't know how to do that. So the feature step has been to study rich features [indiscernible] for which model. So we start with over 100 features. So we need to address the curse of the dimensionality to reduce the number of features. And also because some features may be misleading because in many cases, using all the features produces much more results than using just a set. And then feature selection [indiscernible] PCA, because it's often preferable when the meaning of features are involved to want to see actually which features are involved. Also, because eventually, we are extracting these features from hundreds of millions of [indiscernible] so if we know which features are better we can actually extract all the [indiscernible] to introduce [indiscernible]. And 15 then, of course, we analyze these sets with the domain scientist [indiscernible]. So the estimating feature importance, just a few more. Sequential feature selection, so basically on a given model, [indiscernible] estimator solve it's like [indiscernible] and call back distances of the test. And in the backward selection [indiscernible] maps. For example, using the [indiscernible] dataset, in ranking algorithms, I got the first -- I got to the features and I chose the [indiscernible]. And then the best three were period, median observation in [indiscernible]. So when I got [indiscernible] does this make sense to you, and he said yes because this shows the relationship between the [indiscernible]. And [indiscernible] is important. [indiscernible] for doing like a cross correlation. So when the maps are equal, you can assume that the parameters have correlated. So let's see some results. And this is like [indiscernible]. So if you see the results from the RR Lyrae dataset. They're using binaries. This is a kind of [indiscernible] classification. It's the dataset most used for benchmark in data mining. And so out of the 60 features, I run some test, referring the feature selection algorithm and these right results. Basically, all of the methods, KNN, the neural networks and the decision trees are [indiscernible] on this dataset. Of course, the best one are still the decision tree and ensemble of decision tree in neural networks. >>: How many data points go into this parametric database? >> Ciro Donalek: >>: It's a very few. And how few? >> Ciro Donalek: For this, 463. >>: 463 photometric measurements on one object or is that 463 [indiscernible]. >>: It's 463 [indiscernible]. 16 >>: How many photometric observations are in a typical -- >>: 250. >>: 250 observations? >> Ciro Donalek: Yeah, it was in the other slide. >>: I'm sorry. >>: I can tell you more about it later. >> Ciro Donalek: >>: Okay. And are they bright and high signal, or are they noisy? Yeah. They're noisy. You can see [indiscernible]. Thank you. >> Ciro Donalek: I can switch to the [indiscernible] that's the most difficult to do. Essentially, okay, this is entering the formula for the systematic search for [indiscernible]. And basically, what it is [indiscernible] very difficult to see. You can see [indiscernible]. These are the results using [indiscernible] three and the self-organizing maps. Now, what is important to notice is that KNN were performing really, really well. In the RR Lyrae dataset, it's like they are basically classifying everything as a CD. So that's why we should [indiscernible] in the overall classifier scheme, we should [indiscernible] output. Because in this case, we know that this model is not good for blazar or CV so we should be able to include this external knowledge in the framework. And I also have started trying to combine, naturally combining [indiscernible]. So now I'm missing [indiscernible], networks and the other inputs. And it's slightly better just using the [indiscernible]. Yesterday, George asked about the difference in dealing with strings and dealing with numbers. And for some model you, can encode this just changing the distances. So this is just yesterday, it's KNN using many different distances and, for example, we cannot see that the [indiscernible] distance is, of course, what [indiscernible] because it's actually made for strings to different strings. And then this is the future work. So the first, you have extreme data mining. 17 There is a better way to combine outputs from the single classifiers. Getting more data for sure, and refine existing models and strategies and add and investigate and add more models. But some of the feature selection, but also the work from [indiscernible] classification using clasps. And if there are any suggestion, I appreciate it. >>: So while the next speaker comes up, do we have any questions? David? >>: So you alluded to this, but you didn't say anything further about it. That is in most real classification tests, especially if you imagine like the future where LSST [indiscernible] follow-up tasks, we have utility considerations, dimension utility considerations. Sometimes follow-up is very expensive. Sometimes there's tons of false positives, so on. Many of these classification methods here combining have no kind of way to incorporate utility information. But some of them do, because some of them produce sort of quasi probabilistic or actual probabilistic output and then you could multiply by utilities and maximize your cache flow. And there's a kind of hidden [indiscernible]. I don't really have that much -- >> Ciro Donalek: No I know what you mean. And naturally, I skipped all the measurement with [indiscernible]. Ashish will be talking about it. So we want to mention that [indiscernible] is one of the problem. >>: [indiscernible]. >>: It's a hard problem. But really, we're facing situations we're going to have to be [indiscernible]. >> Ciro Donalek: >>: Yeah, good point. All right, a real quick one? >>: So I wasn't clear. When you tried to assign blazars or CVs, are you just using the time series data, or are you accessing WISE and other surveys? >> Ciro Donalek: No, for these results, I'm using just the features extracted from the light curve. But in dataset, we have all the features. Again, in the dataset, we use other things. 18 >>: Okay. >>: Well, let's thank the speaker again. talking about time series explorer. So next up, we have Jeff Scargle >> Jeff Scargle: Okay. Well, thanks for the opportunity to speak here. Basically, this is a very simple presentation of some ideas for input to the classification scheme or just in general things that can be done with time series data. Acknowledgment. Brad Jackson is the key person who, mathematician who essentially discovered the key Bayesian block algorithm, the key to it and so on. And in a nutshell, here's what I'm going to talk about. Basically, there are many algorithms, standard, old ones and some new ones that can be applied to astronomical time series data in sort of a general context of arbitrary sampling and uneven sampling and other data problems that all real world data has. These can be easily computed in some sense rapidly and displayed for large databases. And that's a bit glib. When the size of the database is huge, that may not be as simple. So this then leads to more input for automatic classification as one of the goals of this kind of thing. And I've tried to sort of organize the different categories of algorithms. Maybe there are some other ways of slicing this. But local structure representations in the time domain, more global approaches in the frequency or time scale domain in the standard power spectrum correlation functions, wavelet power spectrum. And I think an important thing that has been used quite a bit in astronomy, but perhaps not enough, is time frequency or time scale distributions. Kind of a hybrid between the other two. I'll show an example not from the main dataset I'm using here, which is the same Catalina sky survey data that we were talking about in this session quite a bit. 19 Basically in the time domain, there's lots of different approaches to parametrics, symbolics representation. I'm going to stress this approach, which has been known under the name of Bayesian blocks of namely generating simple piece-wise constant linear or exponential models of data. And here are some of the problems that I alluded to a bit earlier that can stand in the way of someone faced with raw, real world data in generating these things. Uneven sampling, gaps in the data, which is another kind of uneven sampling, observational errors or perhaps errors that vary from point to point. Variation in the exposure so that there's some conversion between the recorded signal and the true brightness that can vary in some hopefully known way. And often, you're faced with a variety of data modes. It turns out that essentially all of these problems can be dealt with in this time domain approach with Bayesian blocks, as I think you'll see as we go on. Here's an example of Bayesian modeling or segmented block approach to some real data. The horizontal axis is time in seconds. This is gamma ray flux from a gamma ray burst. And the yellow histogram is simply a way of representing the raw data. The raw data are time tag events for individual photons, and the only way to make sense graphically is to somehow bin it and that's what the yellow histogram is with arbitrary binning. Of course, when you pick an arbitrary bin, you never know ahead of time, are you making the bins too big or too small. Are you smearing out real features or are you allowing statistical fluctuations to show through. And in some sense, the Bayesian block algorithm and the representation shown as a blue histogram is addressing that issue. It's basically an adaptive histogram, in a way, of the time tags for the individual photons, where the data determines the locations of the bins and the widths of the bins. It's completely data-adaptive. And I just recently posted a long overdue paper with the details of this. There's an old paper that, unfortunately, has been sitting around for, I guess, over a decade now that is really -- I mean, I had the basic ideas there, but the algorithm in it is obsolete. This is the new algorithm based on dynamic programming and this is really the magic part of the algorithm that Brad 20 Jackson and the group at San Jose State came up with. I think this concept of dynamic programming should be more well known in this field. I'll show a little more detail about it later. But basically, it solves the problem of finding the optimal partition of this dataset into a finite set of blocks like this, and that's a huge exponentially large problem. But this algorithm allows an exact solution in N squared complexity or N squared time. And by the way, this paper is posted in the astrophysics source code library. There's a poster about that new repository. And in addition on the archive and hopefully in the final publication, all of the code necessary to reproduce the figures in the paper will be included. All the data sets that went into it and all the code. And this is implementing a concept that, again, I think astronomers should know more about and use. Namely, reproducible research. This is an excellent paper talking about the idea, basically showing that scientists have been really fairly cheesy in the past at publishing only the tip of the iceberg or an advertisement for results. You could almost never actually reproduce the details in most papers. So this is enforcing discipline of publishing everything so that anybody could reproduce the results in detail from scratch, so to speak. Here's -- I can't resist showing this, because this is applying the Bayesian blocks algorithm to my favorite object, the crab nebula, which for years was known as a standard candle assumed to be constant in flux at all wavelengths and, of course, now we know that in gamma rays, it's variable with flares in it and all kinds of interesting things. The red blocks show the Bayesian blocks analysis of this set of 1,500 data points, and that number of data points has ten to the 468 possible partitions. And this algorithm finds the optimum from that number explicitly. I'm sorry. >>: Is that a map reduced step? >> Jeff Scargle: No. That's a good point. This is sort of orthogonal to the whole idea of map reduce. You don't have to necessarily go down a tree. Dynamic programming is another way to cut down the tree, so to speak. 21 And it's so simple, it's almost embarrassing to show the mat lab code to do it. Of course, there are a few hidden things. But this is the class function, the fitness function that you use is in this function. But otherwise, the idea behind the algorithm is completely contained in this simple mat lab code. And if I had a few more minutes, I could actually prove to you how the algorithm works and exactly what the dynamic programming steps do. And this is sort of the cartoon that I would use for that. But in view of time, I'll skip that. It's in the paper that I mentioned earlier. People always ask, well, what are the errors in the block representation. And I always answer, well, what do you mean? I'm delivering a whole function. What do you mean by the errors? And there are various ways of looking at it. This is sort of a bootstrap approach to that question, where the blocks are the actual block solution. The solid line is kind of a bootstrap average, model averaging with bootstrap, and then the spread is given by the thinner lines. Another thing you might be interested in is what are the errors in the locations of the change points, the points in time that mark the edges of the block, and here's a simple way of looking at the posterior project for each change point where you fix all but one change point and see how the posterior varies. So it's a little bit of a cheat, but not too bad. And another thing that arises sometimes is you have data on a circle, and sort of we all -- we pondered that for quite a long time. And if you think about it, the algorithm has to start at an origin of time. It works recursively starting from one data point and adding successive data points. So if you don't have an origin, which is the case on a circle, in a way the algorithm is fundamentally afforded. And there's a way around that, and this is just a toy example of a histogram on a circle, angular data or something like that. And I don't have time to go into the details of this, but there's a very nice theoretical paper by a Stanford group, Dave Donoho and Arias-Castro on theoretical analysis of sort of the optimal detection problem in general deriving a theoretical kind of limit. It's an asymptotic limit for detection of signals and noisy data, and they derive an amplitude, A1 and A2 here in two 22 different cases. A theoretical amplitude. And this is just some simulated results showing that this is the error rate in simulated data as a function of the amplitude of the signal relative to this theoretical result. And the fact that you start, the error rate drops dramatically at about unity in this ratio and it shows that this algorithm is essentially approaching the theoretically detection limit. Okay. And another thing you can do, of course, is look at power spectra and correlation functions. And here's a -- this is a great paper that's fairly old, but has been well used in astronomy for calculating correlation functions from arbitrarily spaced data. Very simple idea, which this line of algebra encapsulates. You basically make some bins in the lag and calculate the, in this case, this is a cross correlation function, X and Y. You just sum the values of X and Y at points that are separated by that particular lag. And it's very useful, because you can start from this and calculate the power spectrum as the Fourier transform of this correlation function. And so this is sort of extending these ideas, you kind of can generate a universal time series analysis machine or automatic processing thing where you -- any data mode, and I, of course, haven't gone into the details, but all of what I've said before and more can be done for any of the standard data modes with any sampling in time to generate correlation functions, everything, structure functions. Essentially, you name it. There may be some things that are of interest that aren't on this list. And all of this can be done, if you have two time series, you can do the cross-versions of all of these things. And here's a sort of a matrix of showing the same thing that essentially, you can do all these things and these are kind of the physical interpretations of what you learn. Are there any questions about this? I'm going to show some specific results now. This is something that I hadn't thought of until I started working with George and the group down at Cal Tech. This is from one of their papers. It's a -- I don't know what to call it a delta M, delta T plot. Basically for every pair of observations, you can calculate the magnitude change and the time interval between the two data points and just do a scatter plot. In this case, it's a density histogram representing that. 23 >>: End up calling is probabilistic structure functions. be a better name. But there's got to >> Jeff Scargle: Maybe we should have a contest. If you couldn't hear, probabilistic structure functions, I think. Frankly, I can't think of a worse name for it, but maybe there is one. Because I hate structure functions is the main ->>: You said it's a scatter. >> Jeff Scargle: Why does it structure in it? George, this is your plot. >>: Why is there structure in the -- >>: It's a seasonal [inaudible]. >> Jeff Scargle: You may have the same question for things I'm going to show now. Here is just sort of implementing the notion of a universal machine that you put in the time series, here's just one object. This is from a several hundred Catalina sky survey, AGNs and blazars, I think, is what I selected. This is a case where here the raw data are shown as the points with error bars, and the Bayesian block representation is shown as a red line. I don't know why it doesn't -- actually, this is a nonvariable object according to the algorithm. I don't know why it doesn't, the red part doesn't go all the way. So then automatically, this is just a histogram of the intervals, the DT intervals. Almost always learn something. This seems trivial, this is just the distribution of the intervals between the data points. But almost every time you look at that with data that somebody gives you, you learn something interesting about the sampling or the experiment. And this is just a histogram of the changes in the intensity. I've converted magnitudes to fluxes. But otherwise, this is like DM. And this is that DM/DT plot for this particular dataset. And superimposed on it, on this color scale is a contour map of the same thing done with Bayesian block as the input data. This is, DX and DT are calculated not for the raw data, but for the blocked data. There's nothing here, because their Bayesian block has no variability so that's a null thing there. But later plots will show that. 24 Here's the auto correlation function. Almost a delta function indicating white noise, which is kind of what's going on here. The power spectrum and the log power spectrum both consistent with essentially white noise because this is a nonvariable object. I haven't systematically picked out the best case or the worst case. kind of randomly picked a few cases to show. I've just Now, here's a case where there's real variability, sort of an increasing over the time of the observations and the block representation follows it in a step function way. There's nothing other than a step function approximation to the data, maybe showing some shorter time scale structure here. And here, now you see that there is enough block information to do this DX DT plot as a contour and you see it's somewhat different from the raw data. My claim is that the block representation has the noise kind of eliminated so it's more the truth of what's going on in this plot. And here you see kind of expected auto correlation. There's some structure perhaps. I don't really think there's anything periodic in this data, but there's some fluctuations. And here's just another case where now in this case, the Bayesian block, DM/DT plot more or less agrees with the ones from the raw data and I think that's just but the signal noise is pretty high so the blocks pretty faithfully follow the data. So, of course, this is kind of in a way crazy, because I've set up a visualization thing for one object, and the whole problem here is there's too many objects to be able to do this for individual objects. So what you really want to do is to extract information of the kind that Ciro and other people have talked about and to input to classification schemes. And so I've done that. I've extracted a few things from the, for example, the slope of the power spectrum and some other quantities and so you can play games with looking -- with studying these now for the ensemble of I think there are several hundred objects. Only the variable ones are kind of included here, the ones that are completely nonvariable I've eliminated. And you can play around with different things that you want to look at. This is calculating the kind of local time derivative, just the local DX, DT 25 adjacent points in time. For the values where X is decreasing, here's the same quantity where X is increasing. And so it's kind of obvious you should have a negative correlation here. This is almost a trivial plot. Here, this is the variance in X, in the increasing part of the light curve. Sorry, decreasing. And this is the variance in the increasing part. And I think the asymmetry in this scatter plot is saying something about the underlying physical process. And so I'm done. Here's -- well, just a couple of quick things. This is a histogram of the slopes of the power law, power spectrum into X, and I think -okay. One last thing. This is completely different data. This is some solar data, chromospheric activity over the last three solar cycles. You see the solar cycle, but there's a lot of detail. If you do a time frequency analysis, almost magically, this is -- these peaks in the time frequency distribution frequency this way in cycles per day, this is the date. This is differential rotation, the fact that the active regions on the sun are adrift in latitude just like sun spots do, so you see different rotation. This, if you look at an year all power spectrum of that data, you never see this. But the time frequency analysis is very sensitive to spectral features that are varying with time. So I think this is a really neat thing to have. Well, I think I'll end there. >>: Thanks very much. Can our next speaker come up while we take a few questions. >>: Your Bayesian blocks was originally written for events data, like gamma rays. Can it be used for, let's say, optical photometric video with [indiscernible]? >> Jeff Scargle: Yes. The newer paper has that explicitly. Has all the details for that. The bin counts, events and the kind of data you were talking about. >>: Real numbers? >> Jeff Scargle: >>: Yeah, right. How did you do the fitness function for the programming. Is there a 26 natural way of choosing the fitness function? >> Jeff Scargle: It's kind of up to you. We worked out several. Basically, you want something that expresses how well the data are -- how constant the data are over an interval. So just the simple variance of the data is not too bad. But better is a maximum likelihood or posterior probability for the Poisson model in the case of event data. So there's some choice there. What you have to have is additivity. The fitness function for a whole interval has to be the sum of the fitnesses of individual blocks. >>: The results would change, right, if you change that? >> Jeff Scargle: Yeah, a little bit. I don't think they change much if you have a reasonable quantity measuring constancy over the interval. >> Ani Thakar: So this is a work in progress. Since Alex already introduced this, the SS SkyServer yesterday, I don't need to spend any time on that, and also Curtis spoke about it this morning. Okay. So this is meant to be a pun and George didn't like it so he put me in the wrong session just before coffee. Anyway, so this is ten years of collected log data, both web hits and SQL query data for the SkyServer. This is an effort we're just beginning, like I said. Actually, SQL part of it that I'll talk about here is, we're just starting that. But Jordan Raddick has been working on basically doing the ten-year SQL to the first traffic report, which I'll describe in a minute. And he's been doing most of the heavy lifting here. And, of course, Jim is the person who got us started on this whole adventure and he insisted this we keep everything, log everything from day one. So this is really still his show. So the first traffic report which a student working with Jim brought together and covered the first five years, 2001 to 2006 of the SkyServer. And Jordan's made a short URL for this in case anybody wants to look it up. So, I mean, this cannot do justice to it. It's really got a lot of rich data in this first traffic report, even for the first five years. And if you're interested, you should take a look at it. But just very partial highlights. The web and SQL traffic during this period was doubling every year. That trend, of course, has not kept up. Things have flattened out considerably. Hundreds of astronomers during this time basically graduated from using either 27 the canned form queries or the sample queries on to using more free-form SQL. We also noted a flurry of activity after each release. So this is basically showing DR1 through DR4 and how there were spikes in the activity. One of the difficulties from this first report was how to separate the traffic from programs or what we call bots and actually human users and this is kind of the best attempt to do that. And so the total web hits and SQL queries and what we call web sessions. Web sessions are basically where the interval between subsequent page hits was more than half an hour. That's where the session ended. So within that, we call that a single session. So for mortals, most of the activity is here in the web hits. Then SQL queries, percentage-wise, it's a very large personal, actually I think this figure might be wrong. It should be, these two should basically, these three should add up to here. So anyway, and spiders are programs that crawl the website in order to develop an index, to create an index to be differentiated from bots, which are actually programs that hit the SQL database in a very heavy way. And actually, if you want to look at the percentages, so in terms of -- so for the web hits, the mortals are the main contributor and then bots are the second largest, and then there are these spiders that are basically trying to create -- they essentially go to the robot text file. And then for the SQL queries, you can see that the bots really are the lion's share of the SQL queries and then the mortals are here and then very few from the spiders. And in terms of the session, it's kind of equally divided between the spiders and the mortals, and then very little from the bots. >>: So sessions, you mean they go their once and do something and they go away, or is it how long they've spent when they go? >> Ani Thakar: It's basically when they have successive page views and the page views are within half an hour of each other. So the time is defined as half an hour between for the maximum. >>: So the bots [indiscernible]. 28 >> Ani Thakar: So the bots are basically programs, for example, people trying to get the whole data download, a new release some of these datacenters or even people who have scripts are wanting to do something similar. So Alex showed this yesterday, and this is basically the ten-year traffic kind of at a glance. So this is, these are the web hits and this is the SQL, the dark SQL traffic. One thing he didn't mention is that these two big things, spikes here in the web hits was the galaxy zoo release in July 2007, which as he mentioned we had to add servers very quickly in order to handle. This was like 45 million hits in a month. But most of that was over a period of a week. This is actually unreal, this 39 million queries in October 2008. Actually, this occurred over just -- most of that was one day and I'll get to that in a minute. This is the SQL to the five-year report that we're preparing, and we plan to extend the original analysis to the larger dataset. But in terms of the web hits and SQL queries, which were both in that first report, we want to separate those into different papers now. So we can do more justice to the SQL query analysis, and that's basically what I'm going to be talking about here. So this is quite a unique dataset. I'm not aware of any other such large time interval dataset for SQL query usage, and I think it's a very good way to try to determine how data intensive science is done. So the question that we really want to ask, starting with who's actually using this, who's using this SkyServer and CasJobs serve SQL tools that we've made available, how often they're using it and how are they using it. Are they getting better at it with time. How complex are the queries, are the complexities increasing, and how are users using SQL. A very important aspect of that is what type of science is being done. This is going to be, you know, more extended -- more difficult thing to determine. And is the system meeting the requirements, how can we improve it and how effective is online help that we've provided. So just a few numbers, sort of aggregate numbers. Total number of SQL queries to date is 194 million. Out of those, 68 million are unique queries. These do a select distinct on the query actual text. Of these total number of queries, 145 million actually succeeded, meaning error return was zero, and that many failed. 29 So in terms of users, the top five SQL users, these are all bots or programs, no surprise there, because they're in the millions, no human can do that many queries, even over ten years. So the top prize goes to university of Victoria, the CDC. And, in fact, most of this was on a single day. October 23rd, 2008. This was actually just before the DR7 release, so the data was online but wasn't publicly announced. And most of these queries, also no surprise, did not succeed because they were just too fast. So about 360K did succeed over that one day. >>: When you say [indiscernible]. >> Ani Thakar: No, I mean, they were just shooting the queries too fast. I mean, there wasn't time to get anything back. So I think there was not enough interval between successive queries. >>: Do you have a policy on just, like, how much you can [indiscernible]. >> Ani Thakar: Yes, actually, the SkyServer has a throttle of 60 queries per minute, and interestingly enough, we ran interest a problem that a teacher using the SkyServer in one of their class exercises and all the students were hitting the buttons at the same time. So we had to make an exception for them. But actually, it was Jim who put in that throttle. I remember once he sent a stern email to the person who was submitting a lot of queries. So starting to look at the distribution of users, and I think this is along the lines of what you were asking, curt. But you I think not enough detail for what you want. These are the organizations that we've detected from the IP information that we have. And so this is the kind of catch-all other category. But these are the university, the non-university colleges. This is K through 12. This is other information service providers. This is the national government institutes and then other government regional. That's basically, you know, dividing the kind of institution that the queries are coming from. In terms of the web hits, this is the kind of distribution. Again, the colors are the same. Color coding is the same. Other is the main category again, but quite a few queries from university sources and government. This is quite a lot of this is from Berkeley lab, et cetera. And then SQL queries, here, you know, university wins by quite a large factor and then there is the 30 unclassified category. of that. But everything else falls into a fairly small portion So the thing that we're trying to do is in terms of figuring out what kind of complexity we are seeing in the SQL queries, a very naive way to do this would be to look at the length of the query, but this doesn't tell us too much. Sometimes it does, but there are quite a few types of long queries that you could write that aren't very sophisticated. Then the next thing we do is look at the numbers and types of joins look at whether people are using group by and order by constructs. And then more advanced things like the cross joins and cross applies, which are more recent additions to the language and then the old cursor, which was doing some of what these do. So those are more sophisticated kind of queries. And then, of course, how are people using the user-defined functions. It also is a type of function. Some functions are quite simple, but then there are others that do quite a lot more. And then this is also a good way to detect what kind of science people are doing. And then, of course, combinations of these things. So we defined SQL templates. So this is basically, we divide queries into templates. First of all, these focus on the successful queries only, so these are the unique successful queries of 69 million. And then do a replace of all the numbers in the query in the SQL with a hash sign and then do a select distinct from the resulting SQL statements and then assign each template that we get from this template ID. This results in just under a million query templates and from this we start to derive this crude complexity index. Again, like I said in the previous slide, based on the presence of certain SQL elements. These templates are, besides classifying queries into classes or, you know, templates, it's really useful because there are only a million templates and querying them is a lot faster than querying the whole, the entire database, especially when you're doing text searches. So this is how you query a template. This is a SQL code we created. We wrote this reg because SQL doesn't have a regular expression replacement or search facility. So this is something we wrote in C-Sharp and then we, you know, squeeze out all the numbers, replace them with hashes and then squeeze out white space comments, et cetera, and group by whatever is the result of this 31 replacement. >>: Okay. So how am I doing for time? Couple minutes. >> Ani Thakar: Okay. So as far as the SQL constructs go, this is where all the join, other join, different types of joins, group by order. This is how -this is the templates and then the lighter bars are the aggregates and the entire dataset and this is basically the queries we use to get these numbers. So then the line is where I basically divide the number of queries by the templates. That gives you some idea of how popular, how frequently a given template is used and this is the kind of distribution we see. Haven't really done much in terms of analyzing this, but that's the kind of data we have right now. The length of the queries, this is, again, not going to tell us too much, but the vast majority of queries are under 100 bites. This is 200, 300, 400 thousand bytes. So nearly 1K is small fraction of that. And most of the bot and program queries are very small. And, of course, there is actually a limit on the query length in the SkyServer so that kind of restricts this as well. A couple of other studies that have been done along similar lines, there is a thesis at Drexel which created this Java log viewer and there was a kind of a feature called sky map, which are the spatial coverage of queries and then there was interactive exploration of the SQL logs with color-coded SQL elements and also statistics viewer. So you could actually view the color coding for different SQL constructs here, which was very useful. Unfortunately, by the time this was finished, it wasn't hooked up to the live SkyServer log database. It was still working on a downloaded snapshot so that really restricted how useful this was. And then Nolan Li, who developed the CasJobs, MyDB service, he also did some analysis of CasJobs queries and studied how, you know, users do data-driven analysis. And two things he studied was the number of MyDB objects per query, so these are basically MyDB tables or any other kind of objects per user, per query. And then the number of linked MyDB objects created from queries, which is a better measure. And only 38% of users had only one dependency, but these users were responsible 32 for 76% of the work flows, so they were quite active. So the next steps try to get templates for the sample queries to see how much the sample queries are being used. Refine the complexity index and then track complexity as a function of time. Try to track SQL session. This is going to be difficult this is, of course, much more relevant for cas job users because there are facilities in MyDB to actually do this kind of extended analysis. See how people are using built-in indices and the hierarchical triangular mesh, HTM spatial index and also get more detailed user demographics, meanings things like what kind of users they are. This is something that Kirk was asking about, users who are scientists or members of public or what kind of, you know, professional or amateur astronomers, et cetera. That's it. Thank you. This is the up to the hour traffic site for the SkyServer, if anybody wants to take a look. >>: So we do have time for one question, which I might start with. Did you ever consider way back when putting in place a registration system and what effect do you think that would have had? Obviously, probably would have reduced the number of people using it, but maybe it it would have also reduced the number of spurious queries and errors. >> Ani Thakar: We have log-in for CasJobs so you have to have an account for CasJobs user, which is then you get your own database and then you can run queries against that database. But for the SkyServer, no, it's just a browser based thing and it's public. we don't really require any kind of registration for that. >>: Do you think it may have reducing spurious traffic or -- >> Ani Thakar: I don't think that's really a big problem. spurious traffic, what exactly do you mean? >>: So When you say Well, there were a lot of, I guess, just hammering the system. >> Ani Thakar: That's actually legitimate users. Just that people are trying to download the data. That may not be the best way to do it, but that's not necessarily illegal. We want people to do that, just maybe not -- 33 >>: Slightly slower. >> Ani Thakar: Yeah. >>: So [indiscernible] is SQL still the way to go if you're designing [indiscernible], is this framework still scaleable for the larger stuff? >> Ani Thakar: Well, so here we're kind of cheating, right. It's not just SQL, it's a unique set of functions and procedures we're using. Plus we've built in this spatial index. So all these kinds of add-ons you need to make SQL work for this kind of data. And that's like one or two orders of magnitude more, at least. So I don't know whether databases will scale to that or you will maybe side DB or something like that. I don't know. >>: All right. We have a coffee break now. after an announcement. Can we be back in 20 minute,

1

Related documents

Products

Support

1

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib