Data-Mining Massive Time Series Astronomical Data Sets a Case

advertisement
Data-Mining Massive Time Series Astronomical
Data Sets { a Case Study
Michael K. Ng1 Zhexue Huang2
Markus Hegland3
1 Department of Mathematics, The University of Hong Kong, Pokfulam Road, H.K.
2 CRC for Advanced Computational Systems, Computer Sciences Laboratory, The
3
Australian National University, Canberra, ACT 0200, Australia
CRC for Advanced Computational Systems, CSIRO Mathematical and Information
Sciences, GPO Box 664, Canberra, ACT 2601, Australia
In this paper we present a new application of data mining techniques to
massive astronomical data sets. Previous data mining applications deal with
time-independent multiple spectral astronomical data [2]. We are concerned with
time series astronomical data. More precisely, our data consists of N time series,
each with a duration of M days. The data set can be viewed as an N M matrix
in which rows are dierent time series and the ordered columns correspond to
consecutive time points when measurements were made. Our primary objective
of mining such data is to classify the time series according to their morphology
and to identify classes which may carry some special signature of morphology.
The real data in this application consists of 40 million time series, each representing a sequence of brightness (light curve) of a star measured in one of two
spectral bands on a daily basis in the MACHO project [1]. Totally, about 20
million stars (i.e., N 2 107) have been measured in the past four years.
More than half a terabyte of time series data has resulted. The scientic discovery tasks in our data mining exercise are to discover new variable stars and to
identify microlensing events represented in some star light curves.
There are a number of big challenges in mining such a large time series
database. The size of the data is an obvious one which is always of concern.
The time dimension is another challenge which is not concerned in the other
astronomical data mining applications [2]. The high dimensionality (i.e., M 1500) and the varying phase factor of the time series make it hard to mine the
data in the time domain. The potentially costly transformation of a large number
of time series to some feature domain is inevitable. Selection of suitable features
for appropriate representation of these time series in terms of conciseness and
accuracy is one of the research problems. The data is far from error-free. Noise,
invalid measurements and missing values exist in every time series. The burden
of preprocessing and massaging the data is tremendous. All these problems have
to be solved with dierent techniques before classication is performed.
The classication is based on dissimilarity of the star light curves. The distance measure in the frequency domain between two stars is used. After star
light curves are converted to vectors in the feature spaces, we run the k-means
algorithm hierarchically over a set of star feature vectors. We use some visual
techniques to verify our clusters. To verify the similarity of stars within clusters
and the dissimilarity of stars among clusters in the feature space, we display
0.35
8
0.3
7
FFT coefficient vlaues.
FFT coefficient vlaues.
the FFT coecients of all stars in dierent clusters. The graph on the left side
of Figure 1 shows three clusters produced based on the FFT coecients. Each
cluster is represented as a smooth surface which indicates that stars are similar
in the clusters. One can also see that the three surfaces are quite distinguishable, which means that the stars in dierent clusters are quite dissimilar. The
graph on the right side of Figure 1 shows a group of stars which were randomly
selected from the variable stars. Because the stars in this group are not clustered according to dissimilarity measure, the surface of this group is very rough,
which implies that the stars in this group belong to dierent clusters. We have
also implemented other visualization tools to show star light curves in the time
domain and found these tools are very useful in verifying clustering results and
identifying variable stars. Figure 2 shows some variable stars identied using our
clustering method and visualization tools from 360 variable candidates selected
from about 20000 stars according to the variable dened by astronomers.
0.25
0.2
0.15
0.1
6
5
4
3
2
0.05
1
0
60
0
40
50
50
40
30
50
40
30
40
20
30
20
30
20
10
Stars.
20
10
10
0
0
10
0
Stars.
Frequency components.
0
Frequency components.
The graph on the left side shows the plots of the FFT coefficients of stars in 3 clusters. The graph on the
right side shows the plots of the FFT coefficients of stars in a randomly selected group.
Fig. 1.
13
14
15
16
blue
17
18
19
20
21
22
23
−2
−1.5
−1
−0.5
0
0.5
blue − red
1
1.5
2
2.5
3
The color magnitude diagram of 20383 stars from the MACHO database. Each dot represents a star. The
crosses are the candidates of variable stars determined by the variable index. The circles are the identified variable
stars.
Fig. 2.
Acknowledgments: The authors wish to acknowledge that this work was carried out within the
Cooperative Research Centre for Advanced Computational Systems established under the Australian
Government's Cooperative Research Centres Program. The authors thank Dr T. Axelrod at MSSSO
for providing us the sample data
References
1. T. S. Axelrod et al, \Statistical Issues in the Macho Project." Mt Stromlo and
Siding Spring Observatories, The Australian National University, 1996.
2. U. M. Fayyad, S. G. Djorgovski and N. Weir, \Automating the Analysis and Cataloguing of Sky Surveys." in Advances in Knowledge Discovery and Data Mining,
(eds) U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth and R. Uthurusamy, AAAI
Press / The MIT Press, 1996, pp. 471{493.
This article was processed using the LATEX macro package with LLNCS style
Download