Neural Tools for Astronomical Data Mining: The Astrovirtual collaboration Giuseppe Longo Department of Physics - DSF University Federico II of Napoli & INAF-NA longo@na.infn.it In Collaboration with: C. Donalek, E.Puddu, S. Sessa – DSF & INAF-NA A. Ciaramella, G. Raiconi, A. Staiano, A.Volpicelli, R. Tagliaferri –DMI/SA F. Pasian, R. Smareglia & A. Zacchei - INAF-TS Munich 10-14-th of June 2002 1 G. Longo et many others Some quotes… A major paradigm shift is now taking place in astronomy and space science. Astronomy has suddenly become an immensely data-rich field, with numerous digital sky surveys across a range of wavelenghts, with many Terabytes of pixels and with billions of detected sources, often with tens of measured parameters for each object… traditional data analysis methods are inadequate to cope with this sudden increase in the data volume…” R.J. Brunner, S.G. Djorgovski and T.A. Prince Massive Datasets in Astronomy, astro-ph/0106 We would all testify to the growing gap between the generation of data and our understanding of it … Ian H. Witten & E. Frank, Data Mining Munich 10-14-th of June 2002 2 G. Longo et many others where do A.I. may fit into astronomical work? K.D.D. A.I. tools (soft computing: neural, fuzzy sets, genetic algorithms, etc.) Munich 10-14-th of June 2002 3 G. Longo et many others • The purpose of KDD is to identify patterns and to extract new knowledge from databases in which the dimension, complexity or amount of data has so far been prohibitively large for unaided human efforts…. Algorithms need to be robust enough to cope with imperfect data and to extract regularities that are inexact but useful… • This is not a technology which you can apply blindly and expect to get good results. Different problems yield to different techniques, …. • The implementation of effective KDD tools is expensive (time, computing, need for specialists), requires coordinated efforts between astronomers and computer scientists (even on a semantic level) Munich 10-14-th of June 2002 4 G. Longo et many others Neural networks as grey boxes INPUT OUTPUT guess feedback • input layer (n neurons) zn • M hidden layer (1 or 2) • Output layer (n' <n neurons) x4 Neurons are connected via activation functions x3 Different NN's given by different topologies, different activation functions, etc. INTERPOLATION x1 input y z3 output z2 z1 PATTERN RECOGNITION Munich 10-14-th of June 2002 x2 Hidden layer 5 G. Longo et many others Some "astronomical" examples Pixel space Catalogue space • Object detection, deblending (segmentation) • Search for known (supervised clustering, ) • Data quality (quality of auxiliary & scientific frames, …) • Search for unknown • Time series analysis (uneven sampled data, etc.) • Data compression • All tasks requiring pattern recognition or interpolations (classification, etc.) • Visualization of multiparametric spaces Munich 10-14-th of June 2002 6 G. Longo et many others Supervised vs unsupervised Supervised Unsupervised The NN learns from a set of examples •Requires "a priori knowledge" •Does not require any "a priori knowledge" (id est, training, validation & test sets) •May be complemented by "labeled" data • Very accurate & faster than traditional methods Munich 10-14-th of June 2002 The NN works on statistical properties of the data 7 G. Longo et many others Each tool has its pro's and con's • MLP's: fast, mainly supervised, easy implementation of non linearity • SOM: little slower, unsupervised, non linear, great visualization capabilities, non physical output • GTM: slower, unsupervised, great visualization, physical output • PCA & ICA linear and non linear: poor visualization, physical output, best on correlated imputs • Fuzzy Similarities: slow on large volumes of data, ill defined problems • Etc… Munich 10-14-th of June 2002 8 G. Longo et many others open The AstroVirtual package import open import compliant header non compliant Head/proc. Code written in MATLAB & C++ preprocessing DEMO on this Laptop unsupervised Supervised unsupervised Parameter and training options supervised Parameter options labeled Training set preparation Labeled unlabeled Label preparation Feature selection via unsupervised clustering Feature selection via unsupervised clustering Fuzzy set SOM GTM Munich 10-14-th of June 2002 Etc. MLP INTERPRETATION 9 RBF Etc. G. Longo et many others ASTRONOMICAL APPLICATIONS • Object extraction • Star/galaxy classification • Data quality from telemetry data (TNG – LTA) • Photometric redshifts for SDSS-EDR • Time series analysis (Cepheids, binaries, AGN, etc.) PARTICLE PHYSICS • Data an. of VIRGO experiment (noise removal) • Data an. of neutrino-oscillation (CERN/INFN) experiment (apex position and energy) • Data analysis of ARGO experiment (event detection and energy) Munich 10-14-th of June 2002 10 G. Longo et many others Unsupervised S/G classification Input data: DPOSS catalogue (ca. 5x105 objects) SOM (output is a U-Matrix); GTM (output is a PDF) •Feature selection (backward elimination strategy) •Compression of input space and re-design of network •Classification •Labeling (500 well classified objects) Munich 10-14-th of June 2002 11 G. Longo et many others Munich 10-14-th of June 2002 Star/Galaxy classification Automatic selection of significant features Unsupervised SOM (DPOSS data) 12 G. Longo et many others Labeling Localization of a set of 500 faint stars Munich 10-14-th of June 2002 13 G. Longo et many others Stars p.d.f galaxies p.d.f cumulative p.d.f G.T.M. unsupervised clustering; S/G – CDF Fieldet many others Munich 10-14-th of June 2002 G. Longo 14 Stars p.d.f galaxies p.d.f cumulative p.d.f 5x105 obj. Munich 10-14-th of June 2002 clustering; G. Longo T.M. unsupervised 15 S/G – CDF Field et many others Photometric redshifts: a mixed case SDSS-EDR DB SOM unsup. completeness Reliability Map SOM unsup. Set construction MLP supervised experiments SOM supervised Feature selection Best MLP model • Input data set: SDSS – EDR photometric data (galaxies) Munich 10-14-th of June 2002 16 • Training/validation/test set: G. Longo et many others SDSS-EDR spectroscopic subsample Step 1: feature selection (BES) Unsupervised/labeled SOM Input parameters (ra, dec, fibermag, petromag, mag, petro_r50, rho, etc.) Selected features: r ; u-g ; g-r ; r-i ; i-z ; r50 ; r90; rho STEP 2: aux. Set construction Unsupervised (SOM) to identify significant clusters in Ndimensional input space (complete coverage of training set) Construction of training/validation and test sets representative of the input data Munich 10-14-th of June 2002 17 G. Longo et many others Step 3 - experiments to find the optimal architecture Varying n. of input, n. of hidden, n. of patterns in the training set, n. of training epochs, n. of Bayesian cycles and inner loops, etc. Convergence computed on validation set Error derived from test set Robust error: 0.02176 Munich 10-14-th of June 2002 18 G. Longo et many others Step 4 – computation of confusion matrices & Flagging out spurious outputs Unsupervised SOM clustering with a posteriori labeling from test set • We train the SOM and assign to each neuron a label corresponding to the class (e.g. redshift < 0.5 = Class 1, redshift > 0.5 = class 2) • Then we evaluate the confusion matrix on the test set and use these statistics for evaluate the completeness of the catalog 60 nodes 120 nodes C1 C2 C1 25121 347 C2 213 2028 Munich 10-14-th of June 2002 C1 C2 C1 24790 678 C2 102 2139 19 ……. G. Longo et many others + new & deeper training set ASTROVIRTUAL CATALOGUE Munich 10-14-th of June 2002 20 G. Longo et many others Preliminary results from an application to TNG-LTA • TNG telemetry monitors continuously a series of parameters (pointing, tracking, actuators of mirrors, etc. • Imput data: 31 parameters (apparently uncorrelated) • SOM unsupervised clustering with "a posteriori" labeling • Quality labels from randomly choosen images obtained during the acquisition of telemetric data Munich 10-14-th of June 2002 21 G. Longo et many others Munich 10-14-th of June 2002 22 G. Longo et many others 3-D U Matrix Munich 10-14-th of June 2002 Similarity coloring 23 G. Longo et many others ? UP: good tracking Below: bad tracking Munich 10-14-th of June 2002 24 G. Longo et many others Munich 10-14-th of June 2002 25 G. Longo et many others CONCLUSIONS • KDD requires strong interaction of expert with "true" computer scientists • Implementation of KDD tools takes a lot of time… in order to be worth the effort, they need to be as general as possible • They may not be the "solution" but for sure they will help in any classification, pattern recognition, interpolation problem encountered in the usage of large DataBases • On a short time scale (ca. 3-5 years) KDD will not affect everyday astronomical work present day astronomical work not based on large DB's and will be confined to large projects only • On a longer time scale KDD will become a more widespread tool.. Most probably A.I. KDD Tools will be hidden behind most DB engines Munich 10-14-th of June 2002 26 G. Longo et many others