Hrvatska, June 3-rd, 2003

advertisement
COST Action n. 283 - progress report, June 2003
Computational and Information
Infrastructures in the Astronomical
Data GRID
Chair: Prof. F. Murtagh – Queen University College Belfast
Giuseppe Longo
Chair of Astrophysics
Department of Physical Sciences
University of Napoli “Federico II”, Italy
& INFN (Italian Institute for Nuclear Physics)
longo@na.infn.it
Hrvatska, June 3-rd, 2003
Methodological background:
Id est: is history teaching us something (or isn’t it?)…
Role of Technological
Breakthroughs
All discoveries
Before 1954
After 1954
Hrvatska, June 3-rd, 2003
Where is (now) the next breakthrough in Astronomy?
Either new channels (better: new information carriers):
• Electromagnetic waves (optical since 1609, other since 60’s)
• Solid samples (70’s ->)
• Gravitational waves (2005 ->)
• Neutrino’s (early 80’s ->)
Or leaps in any of:
• Sensitivity
• Spectral range
• Spectral resolution
• Angular resolution
• Time resolution
Hrvatska, June 3-rd, 2003
The iAstro people believe that:
1000
Massive data
mining
Massive data
sets
Distributed
computing
100
10
1
0.1
1970
1975
1980
1985
1990
1995
2000
CCDs
Glass
Hardware breakthrough:
wide field imaging with CCD Mosaics
enables digital surveys
Discoveries
The Sky covers 40.000 sq. Deg.
With 0.6 arcsec sampling: 2 x 1012 pxl
8 TB for band (10/100 TB/survey)
Ca. 10 PB keeping temporal resolution
(Dt ca. 24 h for 1 yr …need for 20 yr)
Hrvatska, June 3-rd, 2003
From Traditional to Survey Science
Traditional:
Survey-Based:
Another Survey/Archive?
Survey
Telescope
Archive
Telescope
Data Analysis
Results
Follow-Up
Telescope
Target Selection
Data Mining
Results
Highly successful and increasingly prominent, but inherently limited by
the information content of individual surveys …
What comes next, beyond survey science is distributed (V.O.) science
Hrvatska, June 3-rd, 2003
Courtesy of G. Djorgovski
A Schematic Illustration of the new astronomy
Primary Data Providers
Surveys
Observatories
Missions
Survey
and
Mission
Archives
VO
Data Services
--------------Data Mining
and Analysis,
Target Selection
Secondary
Data
Providers
Follow-Up
Telescopes
and
Missions
Results
Digital libraries
Hrvatska, June 3-rd, 2003
Courtesy of G. Djorgovski
Panchromatic view of the Universe:
Radio
Visible + X-ray
Far-Infrared
Visible
Dust Map
Density Map
New domains of the parameter space: cf. time
Search for the unknown
Offers:
Different physics
Global understanding
Comparison with theory
New discoveries
Hrvatska, June 3-rd, 2003
Faint, Fast Transients (Tyson et al.)
Pixel space (raw data; TB/PB)
Huge data flow
data fusion
need for recalibrations
Catalogue space (features; TB)
Flux
Non-EM …
Morphology / Surf.Br.
Time
Wavelength

Proper
motion
RA
Hrvatska, June 3-rd, 2003
Polarization
Dec
Calls for…
Automatic catalogue extraction
spurious features removal
image parametrization and
classification
data compression
multiscale analysis, etc.
High dimensionality (N>>100)
What is the coverage?
Where are the gaps?
Calls for…
Feature selection
clustering
statistics
KDD
Visualization, etc…
Sounds Beautiful ! …. BUT:
1000
10
0,1
T2 (Moore)~1.5 years
Hours of
Computer
Time/
Night
0,001
0,00001
0,0000001
2000
1900
1800
1700
1600
1500
Terascale (Petascale?) computing and/or better algorithms
are required
Hrvatska, June 3-rd, 2003
Digital sky surveys call for huge increases in
computing power
In modern data sets: DD >> 10, DS >> 3
Data Complexity  Multidimensionality  Discoveries
But the bad news is …
The computational cost of clustering analysis:
K-means: K  N  I  D
Expectation Maximisation: K  N  I  D2
Monte Carlo Cross-Validation: M  Kmax2  N  I  D2
N = no. of data vectors, D = no. of data dimensions
K = no. of clusters chosen, Kmax = max no. of clusters tried
I = no. of iterations, M = no. of Monte Carlo trials/partitions
Some dimensionality reduction methods do exist (e.g., PCA, class
prototypes, hierarchical methods, etc.), but more work is needed
Hrvatska, June 3-rd, 2003
Hrvatska, June 3-rd, 2003
 “Standard Activities”
all meeting reports and proceedings on the web
• First and Second MC meetings, Brussels,
11/23/2001 & 2/14-15/2002
• Third MC meeting, Edinburgh, 07/21/2002
(at GGF-5, Global Grid Forum 5)
• Fourth MC meeting & workshop on Multispectral
data analysis, and image metadata,
Strasbourg, 11/28-29/2002
• Fifth MC meeting & workshop on High/low
resolution signal processing, Granada,
02/22-23/2003
• Planned: Sixth MC meeting & workshop on
Poisson noise models, Nice, Oct. 2003.
• Planned: Seventh MC meeting & workshop on
Data mining & Image analysis in a
distributed environment, Capri, Mar. 2004.
Hrvatska, June 3-rd, 2003
Guess who was taking the picture…
Granada,
february
Hrvatska, June
3-rd, 2003 2002
Major Orientation of iAstro in early 2003: FP6
• Expressions of Interest filed in - summer 2002.
• Participation in Commission Information Days.
• Involvement in several NoEs (sensor fusion,
information retrieval, e-education and training,
the European virtual observatory, and digital
signal processing and data mining in medicine).
• Participation in evaluation panels.
Hrvatska, June 3-rd, 2003
 COST 283 proposal for the Marie Curie RTN network
“GridFocus: Data and Information Fusion and Mining in
the Context of the DataGrid”
• Submitted early April 2003.
• Participants: iAstro partners in BG, CH, D, E, F, GR, H, I, IRL and UK.
Additional partner cluster in University of Paris Sud.
 Multiband and multiple layer image and
signal processing as a basic paradigm for the
data Grid.
 Data mining of visual and other streams,
including high performance forensic image
data mining.
 Empirical and virtual data interfaces.
Hrvatska, June 3-rd, 2003
GridFocus concept based on data dynamics and information thermodynamics
Hrvatska, June 3-rd, 2003
SOMETHING ON
SCIENCE….
Hrvatska, June 3-rd, 2003
open
Code in C++
Parallelized on Beowulf
import
open
import
compliant
header
non
compliant
Head/proc.
Used (so far) for
preprocessing
Cosmology
particle Physics (ARGO)
Gravitational Waves (VIRGO)
unsupervised
Supervised
unsupervised
Parameter and
training options
supervised
Parameter options
labeled
Training set
preparation
Labeled
unlabeled
Label preparation
Feature selection
via unsupervised clustering
Feature selection
via unsupervised clustering
Fuzzy set
SOM
GTM
Etc.
MLP
INTERPRETATION
Hrvatska, June 3-rd, 2003
RBF
Etc.
A standard clustering example: unsupervised S/G classification
Input data: DPOSS catalogue (ca. 5x106 objects, 50 features each)
SOM (output is a U-Matrix) ~ GTM (output is a PDF)
1. Input data (Tables or strings)
2. Feature selection (backward elimination strategy)
3. Compression of input space and re-design of network
4. Classification
5. Labeling (e.g. 500 well classified objects)
6. …freeze & run on real data
Hrvatska, June 3-rd, 2003
Star/Galaxy classification
Automatic selection of significant features
Unsupervised SOM (DPOSS data)
Hrvatska, June 3-rd, 2003
Labeling
Localization of a set of 500 faint stars
Hrvatska, June 3-rd, 2003
Stars p.d.f
galaxies p.d.f
cumulative p.d.f
G.T.M. unsupervised clustering; S/G
Hrvatska, June 3-rd, 2003
Stars p.d.f
galaxies p.d.f
cumulative p.d.f
5x105 obj.
T.M.
unsupervised clustering; S/G – CDF Field
Hrvatska, June 3-rd, 2003
Photometric redshifts: a mixed case
SDSS-EDR DB
SOM unsup.
completeness
Reliability
Map
SOM unsup.
Set construction
MLP supervised
experiments
SOM supervised
Feature selection
Best MLP
model
• Input data set:
SDSS – EDR photometric data
(galaxies)
Hrvatska, June 3-rd, 2003
• Training/validation/test set:
SDSS-EDR spectroscopic subsample
Step 3 - experiments to find the optimal architecture
Varying n. of input, n. of hidden, n. of patterns in the training set, n. of training
epochs, n. of Bayesian cycles and inner loops, etc.
Convergence computed on validation set
Error derived from test set
Robust error: 0.02176
Hrvatska, June 3-rd, 2003
iAstro strategy:
• Advance the state of the art through our
workshops and visits.
Manyof these exchanges presented results in a
Special Issue of Neural Networks (Ed. Tagliaferri
and Longo, vol 16 3-4, 2003). Status: ongoing.
• Define our role vis-à-vis large Framework
Programme projects on the virtual
observatory, grid, computer vision, etc.
through an iAstro White Paper.
Status: done in early 2003.
• Spin-off specific targeted actions where
greater resources are needed.
Status: GridFocus Marie Curie RTN network
proposal written and submitted in early 2003;
local initiatives
• Next step: spin-out and commercial
exploitation of our work through a STREP
or IP proposal?
Hrvatska, June 3-rd, 2003
Where & how to know more about iAstro:
iAstro web pages:
http://www.iastro.org
To join the iAstro Mailing List:
send a message to:
iAstro-request@qub.ac.uk
Hrvatska, June 3-rd, 2003
Thanks to: E.U. & to… Prof. Fedi
Download