Knowledge Extraction from Scientific Data Roy Williams California Institute of Technology roy@caltech.edu KE Tools SDMIV 24 October 2002 Edinburgh S Data Scientific Data Datacubes N-dimensional array – spectrum, time-series, – image, voxels, hyperspectral image Concentration Pattern matching Integration Event Sets Often derived from pattern matching A set of events is a table Integrating Event Sets Clustering Knowledge Extraction Concentration principle components cluster/outlier finding Datacube Eventset Pattern matching From theory or from training set Integration registration of datacubes join / crossmatch of eventsets Datacube Some stars from the DPOSS survey Datacube An AVIRIS image of San Francisco Bay atmospheric absorption 400-2500 nm in 224 bands R. Green, JPL Concentrating Information eg Principle Component Analysis Given a set of vectors Compute dot products (same as correlations) Diagonalize Throw out weaker (noise) components Information concentration Principle Component Analysis Event Sets Created by pattern matching from a known rule from a training set by finding clusters Event Set = Table 103? name=ID content=key units=none datatype=char 108? E3948547 E3948545 E3943766 name=longitude content=Earth coordinate units=degrees datatype=double display=f6.2 43.4 87.2 83.2 Gravitational Lenses Pattern matching finds events in datacubes A. Szalay, Johns Hopkins Black hole collisions LIGO: Laser Interferometric Gravitational Wave Experiment Creating Event Sets Supervised Classification Given a set of volcanoes, find a lot more volcanoes Here we use Singular Value Decomposition Multiparameter data all sources high fX/fopt colour-colour-fx/fopt stellar galaxy compact galaxy symbols: X-ray source counterparts contours: all optical objects BLAGN medium fX/fopt low fX/fopt active dM stars F/G stars? NELGs BLAGN Mike Watson Leicester University possible hi-z quasar normal galaxies? Integrating Datacubes Find a mapping from one domain to the other Registration of DPOSS and Hubble Deep Field Datacube Registration Movement of ice inferred from registration Integrating Event Sets Database Join Fuzzy Join eg astronomical crossmatch Distributed Join does the Grid do databases? Integration of Star Catalogs Visualizing Event Sets Unsupervised clustering 50000 stars in color-color space A Grid of Services Human gets Data Understood by human Further processing after format change Network of Services Grid of pipes and engines Switches and actuators data flow Example Grid of Services Catalog Service Query Check Service Query Estimator DPOSS Service Crossmatch Service User’s code 2MASS Service Storage Service flexible complex metadata AND broadband binary Computing Challenges • High-dimensional Clustering & Classification Visualization Outlier Detection • Visualization of 1010 points • Database access to 1010 points • Large Distributed Join Standards needed • Bundling diverse objects together with code and references • Referencing data resources on the Grid local, remote, replicated, .... Problem Solving Environment Catalog Service Query Check Service Query Estimator DPOSS Service Crossmatch Service User’s code Storage Service 2MASS Service •Plumbing (big data) and electrical (control, metadata) •Web service and workflow •Finding service classes/implementations by semantics •GUI / Executive / IO adapters / Algorithms