Potential applications in CLRC/RAL collaborations Julian Gallop October 2002

advertisement
Potential applications in CLRC/RAL collaborations
Julian Gallop
October 2002
24/25 October 2002
SDMIV workshop – Julian Gallop
1
commercial / scientific
• Data mining well known in commercial applications
– should the own brand cornflakes be located next to
the beer
• Less well known in scientific applications
• Among scientists, it’s common to find
– “not sure that what I need is data mining, but
instead ….”
• Perhaps data mining is regarded too narrowly
24/25 October 2002
SDMIV workshop – Julian Gallop
2
Definitions
• an early (1991) definition of Knowledge Discovery in
databases (KDD) was given as:
– "the non-trivial extraction of implicit, previously unknown,
and potential useful information from data" (Frawley et. al.
1991).
• this was subsequently (1996) revised to:
– "the non-trivial process of identifying valid, potentially
useful and ultimately understandable patterns in data"
(Fayyad et al 1996).
• data mining is one step in the KDD process concerned with applying computational techniques
to find patterns in data
24/25 October 2002
SDMIV workshop – Julian Gallop
3
CLRC scientific fields and collaborations
• Sciences: space, earth observation, particle physics,
microstructures, synchrotron radiation . . .
• Holds (or provides access to) significant data collections
• Partnerships between E-science centre, BITD, computational
science and science departments
• E-science projects include:
– Ones that are mainly CLRC (e.g. Data Portal)
– UK e-science collaborations (e.g. Astrogrid, NERC Data
Grid, gViz)
– EU collaborations (e.g. DataGrid)
– And also the UK Grid Support Centre
24/25 October 2002
SDMIV workshop – Julian Gallop
4
Sample CLRC e-science project – Data Portal
• Data Portal project – pilot project within CLRC:
–
To enable a scientist to discover, explore and retrieve disparate datasets
through one interface, independent of the data location.
–
CLRC sciences - space science, synchrotron science and neutron science as well as e-science and IT.
–
Part of the work is the development of a scientific metadata model
24/25 October 2002
SDMIV workshop – Julian Gallop
5
Sample e-science projects involving CLRC
• Astrogrid (UK)
– Building a virtual observatory
– Ideas on data mining:
• Finding: association rules; deviations from a rule; similarity;
clustering and classification
• Datagrid (EU): aims to enable next generation scientific exploration
which requires intensive computation and analysis of shared largescale databases, millions of Gigabytes, across widely distributed
scientific communities.
– Applications are: biomedical, earth observation, particle
physics
• NERC Data Grid (UK)
24/25 October 2002
SDMIV workshop – Julian Gallop
6
NERC Data Grid
• Funded by NERC & UK e-science core programme
• Involves:
– CLRC (RAL & DL – including British Atmospheric Data
Centre)
– Program for Climate Model Data Intercomparison
(PCMDI) (U.S. Lawrence-Livermore National Lab)
• Relevant to:
– energy; water management; food chain; health;
weather risk
24/25 October 2002
SDMIV workshop – Julian Gallop
7
NERC Data Grid – relevance to knowledge
discovery
• Aims to address problem that
– at present searching metadata to discover and retrieve
what you want is a manual process
– Datasets in multiple locations involve multiple logins and
retrieval in multiple formats
• indicators of success:
– that it will be possible to find, reformat and visualize
disparate datasets from disparate organisations within one
organisation
– Ability to test data and comparison ideas without learning
foreign formats and establishing personal relationships
every time
• Clearly will provide a basis for knowledge discovery if
successful
24/25 October 2002
SDMIV workshop – Julian Gallop
8
Earth observation instruments
• For example ENVISAT
• Instrument AATSR
• Low orbit, 14/day
• Returns to same place every
3 days
• Picture shows plume from
Mt Etna in 2001 (previous
instrument ATSR2)
• NASA AQUA TBs/day
24/25 October 2002
SDMIV workshop – Julian Gallop
9
Earth observation patterns
• For particular location, what patterns emerge on:
– A daily basis
– Or a yearly basis
• Knowing the conventional pattern day by day, can
observe out of the ordinary events e.g. an oil slick
24/25 October 2002
SDMIV workshop – Julian Gallop
10
climateprediction.net
• Makes use of spare compute capacity
on office and home PC’s to run a
climate prediction model
• Different PC’s run different
parameters and collectively run a
Monte Carlo simulation
• Results will be studied to find out
which subsets of the parameter space
correspond to observation
• Better understanding of uncertainties
• Public understanding of climate change
• Oxford U, CLRC RAL, Reading U, with
Met Office and OU
24/25 October 2002
SDMIV workshop – Julian Gallop
11
Data in climateprediction.net
• base
• variables
– Latitude
96
– Horizontal velocity
– Longitude
72
– Temperature
– Levels
19
– Surface pressure
– Timesteps calculated
every 30mins / 1hr and
output for every day
over a period of 50 years
17000 registered in advance of
launch
24/25 October 2002
– Water vapour (atmosphere)
– Salinity (ocean)
• Possible others, such as
ocean carbon content and
atmospheric ozone and
sulphates
SDMIV workshop – Julian Gallop
12
parameters in climateprediction.net
• Physics parameters that may be varied between one
run and another:
–
–
–
–
–
–
Representation of cloud variability
Rate at which water droplets collide and cohere
# of nucleation particles for coloud droplet formation
Light scattering in the atmosphere
Cloud convection
Surface processes such as rate of transpiration by
plants
• Also, runs will be duplicated to detect tampering
24/25 October 2002
SDMIV workshop – Julian Gallop
13
Data distribution in climateprediction.net
• Results dataset will be distributed at several
(possibly 20) climate modelling institutions
• A subset of data is returned from a PC to a data
server. Remainder is therefore kept on the (home
or office) PC and available – if the owner so
chooses.
• Program attempting to data mine needs to be
isolated from these details, by appropriate portal,
metadata and/or catalogue
24/25 October 2002
SDMIV workshop – Julian Gallop
14
Climateprediction.net questions
• Some questions that need to be askable
– What features of the response are robust as we
change the physics?
– What kind of changes have similar effects to each
other?
– What models that are consistent with current
observations give changes in extreme events in the
future
• Unclear whether this is data mining in strict sense,
but certainly multivariate statistical techniques
24/25 October 2002
SDMIV workshop – Julian Gallop
15
Summing up
• NERC Data Grid project, for example, exposes current
difficulties of doing data mining on large scientific datasets
– In commercial situation, data is warehoused under single
operational control
– In science, access is needed to different datasets which are
under different managements
– Multiple logins, multiple metadata systems
• Current e-science projects are providing a mechanism, which
future data mining could use
• Applications include: earth observation; particle physics;
astronomy; biology; . . . . .
24/25 October 2002
SDMIV workshop – Julian Gallop
16
Download