COST Action n. 283 - progress report, June 2003 Computational and Information Infrastructures in the Astronomical Data GRID Chair: Prof. F. Murtagh – Queen University College Belfast Giuseppe Longo Chair of Astrophysics Department of Physical Sciences University of Napoli “Federico II”, Italy & INFN (Italian Institute for Nuclear Physics) longo@na.infn.it Hrvatska, June 3-rd, 2003 Methodological background: Id est: is history teaching us something (or isn’t it?)… Role of Technological Breakthroughs All discoveries Before 1954 After 1954 Hrvatska, June 3-rd, 2003 Where is (now) the next breakthrough in Astronomy? Either new channels (better: new information carriers): • Electromagnetic waves (optical since 1609, other since 60’s) • Solid samples (70’s ->) • Gravitational waves (2005 ->) • Neutrino’s (early 80’s ->) Or leaps in any of: • Sensitivity • Spectral range • Spectral resolution • Angular resolution • Time resolution Hrvatska, June 3-rd, 2003 The iAstro people believe that: 1000 Massive data mining Massive data sets Distributed computing 100 10 1 0.1 1970 1975 1980 1985 1990 1995 2000 CCDs Glass Hardware breakthrough: wide field imaging with CCD Mosaics enables digital surveys Discoveries The Sky covers 40.000 sq. Deg. With 0.6 arcsec sampling: 2 x 1012 pxl 8 TB for band (10/100 TB/survey) Ca. 10 PB keeping temporal resolution (Dt ca. 24 h for 1 yr …need for 20 yr) Hrvatska, June 3-rd, 2003 From Traditional to Survey Science Traditional: Survey-Based: Another Survey/Archive? Survey Telescope Archive Telescope Data Analysis Results Follow-Up Telescope Target Selection Data Mining Results Highly successful and increasingly prominent, but inherently limited by the information content of individual surveys … What comes next, beyond survey science is distributed (V.O.) science Hrvatska, June 3-rd, 2003 Courtesy of G. Djorgovski A Schematic Illustration of the new astronomy Primary Data Providers Surveys Observatories Missions Survey and Mission Archives VO Data Services --------------Data Mining and Analysis, Target Selection Secondary Data Providers Follow-Up Telescopes and Missions Results Digital libraries Hrvatska, June 3-rd, 2003 Courtesy of G. Djorgovski Panchromatic view of the Universe: Radio Visible + X-ray Far-Infrared Visible Dust Map Density Map New domains of the parameter space: cf. time Search for the unknown Offers: Different physics Global understanding Comparison with theory New discoveries Hrvatska, June 3-rd, 2003 Faint, Fast Transients (Tyson et al.) Pixel space (raw data; TB/PB) Huge data flow data fusion need for recalibrations Catalogue space (features; TB) Flux Non-EM … Morphology / Surf.Br. Time Wavelength Proper motion RA Hrvatska, June 3-rd, 2003 Polarization Dec Calls for… Automatic catalogue extraction spurious features removal image parametrization and classification data compression multiscale analysis, etc. High dimensionality (N>>100) What is the coverage? Where are the gaps? Calls for… Feature selection clustering statistics KDD Visualization, etc… Sounds Beautiful ! …. BUT: 1000 10 0,1 T2 (Moore)~1.5 years Hours of Computer Time/ Night 0,001 0,00001 0,0000001 2000 1900 1800 1700 1600 1500 Terascale (Petascale?) computing and/or better algorithms are required Hrvatska, June 3-rd, 2003 Digital sky surveys call for huge increases in computing power In modern data sets: DD >> 10, DS >> 3 Data Complexity Multidimensionality Discoveries But the bad news is … The computational cost of clustering analysis: K-means: K N I D Expectation Maximisation: K N I D2 Monte Carlo Cross-Validation: M Kmax2 N I D2 N = no. of data vectors, D = no. of data dimensions K = no. of clusters chosen, Kmax = max no. of clusters tried I = no. of iterations, M = no. of Monte Carlo trials/partitions Some dimensionality reduction methods do exist (e.g., PCA, class prototypes, hierarchical methods, etc.), but more work is needed Hrvatska, June 3-rd, 2003 Hrvatska, June 3-rd, 2003 “Standard Activities” all meeting reports and proceedings on the web • First and Second MC meetings, Brussels, 11/23/2001 & 2/14-15/2002 • Third MC meeting, Edinburgh, 07/21/2002 (at GGF-5, Global Grid Forum 5) • Fourth MC meeting & workshop on Multispectral data analysis, and image metadata, Strasbourg, 11/28-29/2002 • Fifth MC meeting & workshop on High/low resolution signal processing, Granada, 02/22-23/2003 • Planned: Sixth MC meeting & workshop on Poisson noise models, Nice, Oct. 2003. • Planned: Seventh MC meeting & workshop on Data mining & Image analysis in a distributed environment, Capri, Mar. 2004. Hrvatska, June 3-rd, 2003 Guess who was taking the picture… Granada, february Hrvatska, June 3-rd, 2003 2002 Major Orientation of iAstro in early 2003: FP6 • Expressions of Interest filed in - summer 2002. • Participation in Commission Information Days. • Involvement in several NoEs (sensor fusion, information retrieval, e-education and training, the European virtual observatory, and digital signal processing and data mining in medicine). • Participation in evaluation panels. Hrvatska, June 3-rd, 2003 COST 283 proposal for the Marie Curie RTN network “GridFocus: Data and Information Fusion and Mining in the Context of the DataGrid” • Submitted early April 2003. • Participants: iAstro partners in BG, CH, D, E, F, GR, H, I, IRL and UK. Additional partner cluster in University of Paris Sud. Multiband and multiple layer image and signal processing as a basic paradigm for the data Grid. Data mining of visual and other streams, including high performance forensic image data mining. Empirical and virtual data interfaces. Hrvatska, June 3-rd, 2003 GridFocus concept based on data dynamics and information thermodynamics Hrvatska, June 3-rd, 2003 SOMETHING ON SCIENCE…. Hrvatska, June 3-rd, 2003 open Code in C++ Parallelized on Beowulf import open import compliant header non compliant Head/proc. Used (so far) for preprocessing Cosmology particle Physics (ARGO) Gravitational Waves (VIRGO) unsupervised Supervised unsupervised Parameter and training options supervised Parameter options labeled Training set preparation Labeled unlabeled Label preparation Feature selection via unsupervised clustering Feature selection via unsupervised clustering Fuzzy set SOM GTM Etc. MLP INTERPRETATION Hrvatska, June 3-rd, 2003 RBF Etc. A standard clustering example: unsupervised S/G classification Input data: DPOSS catalogue (ca. 5x106 objects, 50 features each) SOM (output is a U-Matrix) ~ GTM (output is a PDF) 1. Input data (Tables or strings) 2. Feature selection (backward elimination strategy) 3. Compression of input space and re-design of network 4. Classification 5. Labeling (e.g. 500 well classified objects) 6. …freeze & run on real data Hrvatska, June 3-rd, 2003 Star/Galaxy classification Automatic selection of significant features Unsupervised SOM (DPOSS data) Hrvatska, June 3-rd, 2003 Labeling Localization of a set of 500 faint stars Hrvatska, June 3-rd, 2003 Stars p.d.f galaxies p.d.f cumulative p.d.f G.T.M. unsupervised clustering; S/G Hrvatska, June 3-rd, 2003 Stars p.d.f galaxies p.d.f cumulative p.d.f 5x105 obj. T.M. unsupervised clustering; S/G – CDF Field Hrvatska, June 3-rd, 2003 Photometric redshifts: a mixed case SDSS-EDR DB SOM unsup. completeness Reliability Map SOM unsup. Set construction MLP supervised experiments SOM supervised Feature selection Best MLP model • Input data set: SDSS – EDR photometric data (galaxies) Hrvatska, June 3-rd, 2003 • Training/validation/test set: SDSS-EDR spectroscopic subsample Step 3 - experiments to find the optimal architecture Varying n. of input, n. of hidden, n. of patterns in the training set, n. of training epochs, n. of Bayesian cycles and inner loops, etc. Convergence computed on validation set Error derived from test set Robust error: 0.02176 Hrvatska, June 3-rd, 2003 iAstro strategy: • Advance the state of the art through our workshops and visits. Manyof these exchanges presented results in a Special Issue of Neural Networks (Ed. Tagliaferri and Longo, vol 16 3-4, 2003). Status: ongoing. • Define our role vis-à-vis large Framework Programme projects on the virtual observatory, grid, computer vision, etc. through an iAstro White Paper. Status: done in early 2003. • Spin-off specific targeted actions where greater resources are needed. Status: GridFocus Marie Curie RTN network proposal written and submitted in early 2003; local initiatives • Next step: spin-out and commercial exploitation of our work through a STREP or IP proposal? Hrvatska, June 3-rd, 2003 Where & how to know more about iAstro: iAstro web pages: http://www.iastro.org To join the iAstro Mailing List: send a message to: iAstro-request@qub.ac.uk Hrvatska, June 3-rd, 2003 Thanks to: E.U. & to… Prof. Fedi