Potential applications in CLRC/RAL collaborations Julian Gallop October 2002 24/25 October 2002 SDMIV workshop – Julian Gallop 1 commercial / scientific • Data mining well known in commercial applications – should the own brand cornflakes be located next to the beer • Less well known in scientific applications • Among scientists, it’s common to find – “not sure that what I need is data mining, but instead ….” • Perhaps data mining is regarded too narrowly 24/25 October 2002 SDMIV workshop – Julian Gallop 2 Definitions • an early (1991) definition of Knowledge Discovery in databases (KDD) was given as: – "the non-trivial extraction of implicit, previously unknown, and potential useful information from data" (Frawley et. al. 1991). • this was subsequently (1996) revised to: – "the non-trivial process of identifying valid, potentially useful and ultimately understandable patterns in data" (Fayyad et al 1996). • data mining is one step in the KDD process concerned with applying computational techniques to find patterns in data 24/25 October 2002 SDMIV workshop – Julian Gallop 3 CLRC scientific fields and collaborations • Sciences: space, earth observation, particle physics, microstructures, synchrotron radiation . . . • Holds (or provides access to) significant data collections • Partnerships between E-science centre, BITD, computational science and science departments • E-science projects include: – Ones that are mainly CLRC (e.g. Data Portal) – UK e-science collaborations (e.g. Astrogrid, NERC Data Grid, gViz) – EU collaborations (e.g. DataGrid) – And also the UK Grid Support Centre 24/25 October 2002 SDMIV workshop – Julian Gallop 4 Sample CLRC e-science project – Data Portal • Data Portal project – pilot project within CLRC: – To enable a scientist to discover, explore and retrieve disparate datasets through one interface, independent of the data location. – CLRC sciences - space science, synchrotron science and neutron science as well as e-science and IT. – Part of the work is the development of a scientific metadata model 24/25 October 2002 SDMIV workshop – Julian Gallop 5 Sample e-science projects involving CLRC • Astrogrid (UK) – Building a virtual observatory – Ideas on data mining: • Finding: association rules; deviations from a rule; similarity; clustering and classification • Datagrid (EU): aims to enable next generation scientific exploration which requires intensive computation and analysis of shared largescale databases, millions of Gigabytes, across widely distributed scientific communities. – Applications are: biomedical, earth observation, particle physics • NERC Data Grid (UK) 24/25 October 2002 SDMIV workshop – Julian Gallop 6 NERC Data Grid • Funded by NERC & UK e-science core programme • Involves: – CLRC (RAL & DL – including British Atmospheric Data Centre) – Program for Climate Model Data Intercomparison (PCMDI) (U.S. Lawrence-Livermore National Lab) • Relevant to: – energy; water management; food chain; health; weather risk 24/25 October 2002 SDMIV workshop – Julian Gallop 7 NERC Data Grid – relevance to knowledge discovery • Aims to address problem that – at present searching metadata to discover and retrieve what you want is a manual process – Datasets in multiple locations involve multiple logins and retrieval in multiple formats • indicators of success: – that it will be possible to find, reformat and visualize disparate datasets from disparate organisations within one organisation – Ability to test data and comparison ideas without learning foreign formats and establishing personal relationships every time • Clearly will provide a basis for knowledge discovery if successful 24/25 October 2002 SDMIV workshop – Julian Gallop 8 Earth observation instruments • For example ENVISAT • Instrument AATSR • Low orbit, 14/day • Returns to same place every 3 days • Picture shows plume from Mt Etna in 2001 (previous instrument ATSR2) • NASA AQUA TBs/day 24/25 October 2002 SDMIV workshop – Julian Gallop 9 Earth observation patterns • For particular location, what patterns emerge on: – A daily basis – Or a yearly basis • Knowing the conventional pattern day by day, can observe out of the ordinary events e.g. an oil slick 24/25 October 2002 SDMIV workshop – Julian Gallop 10 climateprediction.net • Makes use of spare compute capacity on office and home PC’s to run a climate prediction model • Different PC’s run different parameters and collectively run a Monte Carlo simulation • Results will be studied to find out which subsets of the parameter space correspond to observation • Better understanding of uncertainties • Public understanding of climate change • Oxford U, CLRC RAL, Reading U, with Met Office and OU 24/25 October 2002 SDMIV workshop – Julian Gallop 11 Data in climateprediction.net • base • variables – Latitude 96 – Horizontal velocity – Longitude 72 – Temperature – Levels 19 – Surface pressure – Timesteps calculated every 30mins / 1hr and output for every day over a period of 50 years 17000 registered in advance of launch 24/25 October 2002 – Water vapour (atmosphere) – Salinity (ocean) • Possible others, such as ocean carbon content and atmospheric ozone and sulphates SDMIV workshop – Julian Gallop 12 parameters in climateprediction.net • Physics parameters that may be varied between one run and another: – – – – – – Representation of cloud variability Rate at which water droplets collide and cohere # of nucleation particles for coloud droplet formation Light scattering in the atmosphere Cloud convection Surface processes such as rate of transpiration by plants • Also, runs will be duplicated to detect tampering 24/25 October 2002 SDMIV workshop – Julian Gallop 13 Data distribution in climateprediction.net • Results dataset will be distributed at several (possibly 20) climate modelling institutions • A subset of data is returned from a PC to a data server. Remainder is therefore kept on the (home or office) PC and available – if the owner so chooses. • Program attempting to data mine needs to be isolated from these details, by appropriate portal, metadata and/or catalogue 24/25 October 2002 SDMIV workshop – Julian Gallop 14 Climateprediction.net questions • Some questions that need to be askable – What features of the response are robust as we change the physics? – What kind of changes have similar effects to each other? – What models that are consistent with current observations give changes in extreme events in the future • Unclear whether this is data mining in strict sense, but certainly multivariate statistical techniques 24/25 October 2002 SDMIV workshop – Julian Gallop 15 Summing up • NERC Data Grid project, for example, exposes current difficulties of doing data mining on large scientific datasets – In commercial situation, data is warehoused under single operational control – In science, access is needed to different datasets which are under different managements – Multiple logins, multiple metadata systems • Current e-science projects are providing a mechanism, which future data mining could use • Applications include: earth observation; particle physics; astronomy; biology; . . . . . 24/25 October 2002 SDMIV workshop – Julian Gallop 16