Artificial Intelligence Tools for Visualisation and Data Mining

advertisement
Neural Tools for Astronomical Data Mining:
The Astrovirtual collaboration
Giuseppe Longo
Department of Physics - DSF
University Federico II of Napoli & INAF-NA
longo@na.infn.it
In Collaboration with:
C. Donalek, E.Puddu, S. Sessa – DSF & INAF-NA
A. Ciaramella, G. Raiconi, A. Staiano, A.Volpicelli, R.
Tagliaferri –DMI/SA
F. Pasian, R. Smareglia & A. Zacchei - INAF-TS
Munich 10-14-th of June 2002
1
G. Longo et many others
Some quotes…
A major paradigm shift is now taking place in astronomy and
space science. Astronomy has suddenly become an immensely
data-rich field, with numerous digital sky surveys across a
range of wavelenghts, with many Terabytes of pixels and with
billions of detected sources, often with tens of measured
parameters for each object… traditional data analysis methods
are inadequate to cope with this sudden increase in the data
volume…”
R.J. Brunner, S.G. Djorgovski and T.A. Prince
Massive Datasets in Astronomy, astro-ph/0106
We would all testify to the growing gap between the
generation of data and our understanding of it …
Ian H. Witten & E. Frank, Data Mining
Munich 10-14-th of June 2002
2
G. Longo et many others
where do A.I. may fit into astronomical work?
K.D.D.
A.I. tools
(soft computing: neural, fuzzy
sets, genetic algorithms, etc.)
Munich 10-14-th of June 2002
3
G. Longo et many others
• The purpose of KDD is to identify patterns and to extract new
knowledge from databases in which the dimension, complexity
or amount of data has so far been prohibitively large for
unaided human efforts…. Algorithms need to be robust enough
to cope with imperfect data and to extract regularities that are
inexact but useful…
• This is not a technology which you can apply blindly and expect
to get good results. Different problems yield to different
techniques, ….
• The implementation of effective KDD tools is expensive (time,
computing, need for specialists), requires coordinated efforts
between astronomers and computer scientists (even on a
semantic level)
Munich 10-14-th of June 2002
4
G. Longo et many others
Neural networks as grey boxes
INPUT
OUTPUT
guess
feedback
• input layer (n neurons)
zn
• M hidden layer (1 or 2)
• Output layer (n' <n neurons)
x4
Neurons are connected via activation functions
x3
Different NN's given by different topologies,
different activation functions, etc.
INTERPOLATION
x1
input
y
z3
output
z2
z1
PATTERN RECOGNITION
Munich 10-14-th of June 2002
x2
Hidden
layer
5
G. Longo et many others
Some "astronomical" examples
Pixel space
Catalogue space
• Object detection, deblending
(segmentation)
• Search for known
(supervised clustering, )
• Data quality
(quality of auxiliary & scientific
frames, …)
• Search for unknown
• Time series analysis
(uneven sampled data, etc.)
• Data compression
• All tasks requiring pattern
recognition or interpolations
(classification, etc.)
• Visualization of multiparametric
spaces
Munich 10-14-th of June 2002
6
G. Longo et many others
Supervised vs unsupervised
Supervised
Unsupervised
The NN learns from a set of
examples
•Requires "a priori knowledge"
•Does not require any "a priori
knowledge"
(id est, training, validation & test
sets)
•May be complemented by
"labeled" data
• Very accurate & faster than
traditional methods
Munich 10-14-th of June 2002
The NN works on statistical
properties of the data
7
G. Longo et many others
Each tool has its pro's and con's
• MLP's: fast, mainly supervised, easy implementation of
non linearity
• SOM: little slower, unsupervised, non linear, great
visualization capabilities, non physical output
• GTM: slower, unsupervised, great visualization, physical
output
• PCA & ICA linear and non linear: poor visualization,
physical output, best on correlated imputs
• Fuzzy Similarities: slow on large volumes of data, ill
defined problems
• Etc…
Munich 10-14-th of June 2002
8
G. Longo et many others
open
The AstroVirtual package
import
open
import
compliant
header
non
compliant
Head/proc.
Code written in
MATLAB & C++
preprocessing
DEMO on this Laptop
unsupervised
Supervised
unsupervised
Parameter and
training options
supervised
Parameter options
labeled
Training set
preparation
Labeled
unlabeled
Label preparation
Feature selection
via unsupervised clustering
Feature selection
via unsupervised clustering
Fuzzy set
SOM
GTM
Munich 10-14-th of June 2002
Etc.
MLP
INTERPRETATION
9
RBF
Etc.
G. Longo et many others
ASTRONOMICAL APPLICATIONS
• Object extraction
• Star/galaxy classification
• Data quality from telemetry data (TNG – LTA)
• Photometric redshifts for SDSS-EDR
• Time series analysis (Cepheids, binaries, AGN, etc.)
PARTICLE PHYSICS
• Data an. of VIRGO experiment (noise removal)
• Data an. of neutrino-oscillation (CERN/INFN) experiment (apex
position and energy)
• Data analysis of ARGO experiment (event detection and energy)
Munich 10-14-th of June 2002
10
G. Longo et many others
Unsupervised S/G classification
Input data: DPOSS catalogue (ca. 5x105 objects)
SOM (output is a U-Matrix); GTM (output is a PDF)
•Feature selection (backward elimination strategy)
•Compression of input space and re-design of network
•Classification
•Labeling (500 well classified objects)
Munich 10-14-th of June 2002
11
G. Longo et many others
Munich 10-14-th of June 2002
Star/Galaxy classification
Automatic selection of significant features
Unsupervised SOM (DPOSS data)
12
G. Longo et many others
Labeling
Localization of a set of
500 faint stars
Munich 10-14-th of June 2002
13
G. Longo et many others
Stars p.d.f
galaxies p.d.f
cumulative p.d.f
G.T.M.
unsupervised
clustering;
S/G – CDF
Fieldet many others
Munich 10-14-th
of June 2002
G. Longo
14
Stars p.d.f
galaxies p.d.f
cumulative p.d.f
5x105 obj.
Munich
10-14-th of June 2002 clustering;
G. Longo
T.M.
unsupervised
15 S/G – CDF Field
et many others
Photometric redshifts: a mixed case
SDSS-EDR DB
SOM unsup.
completeness
Reliability
Map
SOM unsup.
Set construction
MLP supervised
experiments
SOM supervised
Feature selection
Best MLP
model
• Input data set:
SDSS – EDR photometric data
(galaxies)
Munich 10-14-th of June 2002
16
• Training/validation/test set:
G. Longo et many
others
SDSS-EDR spectroscopic
subsample
Step 1: feature selection (BES)
Unsupervised/labeled SOM
Input parameters (ra, dec, fibermag, petromag, mag, petro_r50, rho, etc.)
Selected features: r ; u-g ; g-r ; r-i ; i-z ; r50 ; r90; rho
STEP 2: aux. Set construction
Unsupervised (SOM)
to identify significant clusters in Ndimensional input space
(complete coverage of training set)
Construction of training/validation and
test sets representative of the input data
Munich 10-14-th of June 2002
17
G. Longo et many others
Step 3 - experiments to find the optimal architecture
Varying n. of input, n. of hidden, n. of patterns in the training set, n. of training
epochs, n. of Bayesian cycles and inner loops, etc.
Convergence computed on validation set
Error derived from test set
Robust error: 0.02176
Munich 10-14-th of June 2002
18
G. Longo et many others
Step 4 – computation of confusion matrices &
Flagging out spurious outputs
Unsupervised SOM clustering with a posteriori labeling from test set
• We train the SOM and assign to each neuron a label
corresponding to the class
(e.g. redshift < 0.5 = Class 1, redshift > 0.5 = class 2)
• Then we evaluate the confusion matrix on the test set
and use these statistics for evaluate the completeness
of the catalog
60 nodes
120 nodes
C1
C2
C1
25121
347
C2
213
2028
Munich 10-14-th of June 2002
C1
C2
C1
24790
678
C2
102
2139
19
…….
G. Longo et many others
+ new & deeper
training set
ASTROVIRTUAL
CATALOGUE
Munich 10-14-th of June 2002
20
G. Longo et many others
Preliminary results from an application to TNG-LTA
• TNG telemetry monitors continuously a series of
parameters (pointing, tracking, actuators of mirrors, etc.
• Imput data: 31 parameters (apparently uncorrelated)
• SOM unsupervised clustering with "a posteriori" labeling
• Quality labels from randomly choosen images obtained
during the acquisition of telemetric data
Munich 10-14-th of June 2002
21
G. Longo et many others
Munich 10-14-th of June 2002
22
G. Longo et many others
3-D U Matrix
Munich 10-14-th of June 2002
Similarity coloring
23
G. Longo et many others
?
UP: good tracking
Below: bad tracking
Munich 10-14-th of June 2002
24
G. Longo et many others
Munich 10-14-th of June 2002
25
G. Longo et many others
CONCLUSIONS
• KDD requires strong interaction of expert with "true" computer
scientists
• Implementation of KDD tools takes a lot of time… in order to
be worth the effort, they need to be as general as possible
• They may not be the "solution" but for sure they will help in
any classification, pattern recognition, interpolation problem
encountered in the usage of large DataBases
• On a short time scale (ca. 3-5 years) KDD will not affect
everyday astronomical work present day astronomical work
not based on large DB's and will be confined to large
projects only
• On a longer time scale KDD will become a more widespread
tool.. Most probably A.I. KDD Tools will be hidden behind most
DB engines
Munich 10-14-th of June 2002
26
G. Longo et many others
Download