Data-Mining Massive Time Series Astronomical Data Sets { a Case Study Michael K. Ng1 Zhexue Huang2 Markus Hegland3 1 Department of Mathematics, The University of Hong Kong, Pokfulam Road, H.K. 2 CRC for Advanced Computational Systems, Computer Sciences Laboratory, The 3 Australian National University, Canberra, ACT 0200, Australia CRC for Advanced Computational Systems, CSIRO Mathematical and Information Sciences, GPO Box 664, Canberra, ACT 2601, Australia In this paper we present a new application of data mining techniques to massive astronomical data sets. Previous data mining applications deal with time-independent multiple spectral astronomical data [2]. We are concerned with time series astronomical data. More precisely, our data consists of N time series, each with a duration of M days. The data set can be viewed as an N M matrix in which rows are dierent time series and the ordered columns correspond to consecutive time points when measurements were made. Our primary objective of mining such data is to classify the time series according to their morphology and to identify classes which may carry some special signature of morphology. The real data in this application consists of 40 million time series, each representing a sequence of brightness (light curve) of a star measured in one of two spectral bands on a daily basis in the MACHO project [1]. Totally, about 20 million stars (i.e., N 2 107) have been measured in the past four years. More than half a terabyte of time series data has resulted. The scientic discovery tasks in our data mining exercise are to discover new variable stars and to identify microlensing events represented in some star light curves. There are a number of big challenges in mining such a large time series database. The size of the data is an obvious one which is always of concern. The time dimension is another challenge which is not concerned in the other astronomical data mining applications [2]. The high dimensionality (i.e., M 1500) and the varying phase factor of the time series make it hard to mine the data in the time domain. The potentially costly transformation of a large number of time series to some feature domain is inevitable. Selection of suitable features for appropriate representation of these time series in terms of conciseness and accuracy is one of the research problems. The data is far from error-free. Noise, invalid measurements and missing values exist in every time series. The burden of preprocessing and massaging the data is tremendous. All these problems have to be solved with dierent techniques before classication is performed. The classication is based on dissimilarity of the star light curves. The distance measure in the frequency domain between two stars is used. After star light curves are converted to vectors in the feature spaces, we run the k-means algorithm hierarchically over a set of star feature vectors. We use some visual techniques to verify our clusters. To verify the similarity of stars within clusters and the dissimilarity of stars among clusters in the feature space, we display 0.35 8 0.3 7 FFT coefficient vlaues. FFT coefficient vlaues. the FFT coecients of all stars in dierent clusters. The graph on the left side of Figure 1 shows three clusters produced based on the FFT coecients. Each cluster is represented as a smooth surface which indicates that stars are similar in the clusters. One can also see that the three surfaces are quite distinguishable, which means that the stars in dierent clusters are quite dissimilar. The graph on the right side of Figure 1 shows a group of stars which were randomly selected from the variable stars. Because the stars in this group are not clustered according to dissimilarity measure, the surface of this group is very rough, which implies that the stars in this group belong to dierent clusters. We have also implemented other visualization tools to show star light curves in the time domain and found these tools are very useful in verifying clustering results and identifying variable stars. Figure 2 shows some variable stars identied using our clustering method and visualization tools from 360 variable candidates selected from about 20000 stars according to the variable dened by astronomers. 0.25 0.2 0.15 0.1 6 5 4 3 2 0.05 1 0 60 0 40 50 50 40 30 50 40 30 40 20 30 20 30 20 10 Stars. 20 10 10 0 0 10 0 Stars. Frequency components. 0 Frequency components. The graph on the left side shows the plots of the FFT coefficients of stars in 3 clusters. The graph on the right side shows the plots of the FFT coefficients of stars in a randomly selected group. Fig. 1. 13 14 15 16 blue 17 18 19 20 21 22 23 −2 −1.5 −1 −0.5 0 0.5 blue − red 1 1.5 2 2.5 3 The color magnitude diagram of 20383 stars from the MACHO database. Each dot represents a star. The crosses are the candidates of variable stars determined by the variable index. The circles are the identified variable stars. Fig. 2. Acknowledgments: The authors wish to acknowledge that this work was carried out within the Cooperative Research Centre for Advanced Computational Systems established under the Australian Government's Cooperative Research Centres Program. The authors thank Dr T. Axelrod at MSSSO for providing us the sample data References 1. T. S. Axelrod et al, \Statistical Issues in the Macho Project." Mt Stromlo and Siding Spring Observatories, The Australian National University, 1996. 2. U. M. Fayyad, S. G. Djorgovski and N. Weir, \Automating the Analysis and Cataloguing of Sky Surveys." in Advances in Knowledge Discovery and Data Mining, (eds) U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth and R. Uthurusamy, AAAI Press / The MIT Press, 1996, pp. 471{493. This article was processed using the LATEX macro package with LLNCS style