Data Mining with AURA Jim Austin University of York &

advertisement
Data Mining with AURA
Jim Austin
University of York
&
Cybula Ltd
Overview
• AURA
• Background to AURA
• Brief overview of its components
• Its implementation
• AURA within UK e-Science
• What is e-Science
• The DAME pilot project
• Use of AURA in DAME
• GRID issues in DM
22 Oct 2001
2
The AURA Technology
• Neural network based associative storage
• Set of tools to build fast pattern recognition
systems
• Aimed at unstructured data
• Aimed at large datasets
• Scaleable technology
22 Oct 2001
3
AURA as a basis for search
• The game is to remove the chaff using
AURA.
• Later processes find the exact match.
22 Oct 2001
4
The storage system
•
•
•
•
Correlation Matrix Memory based
Exploits threshold logic methods
Uses distributed encoding of information
Implemented using binary ‘weights’ for
efficient software and hardware
implementation
22 Oct 2001
5
weights (
P
)
M
Inputs
Threshold, T
R
22 Oct 2001
6
Why is it fast?
• Access only rows that are activated by
inputs.
• Inputs are made as sparse as possible and
fixed weight.
• Only need to sum over active rows (bit
vectors) – ideal for most processors
• Great for bit vector machines (DAP!).
22 Oct 2001
7
Use of the CMM
Query
CMM system
Data
Data subset
Slow algorithm
Final data
22 Oct 2001
8
CMM system
Pre-process
Operations
Prepare data
CMM system
22 Oct 2001
Post process
9
Pre-processing
• Implements a number of pre-processors
–
–
–
–
–
N-grams for text strings
CMAC for numeric data
Graphs for images and graphics
Tokens for logical data
Quantisation for time series
22 Oct 2001
10
Post processing
• Data selected by the CMM must be
accessed quickly.
• Uses ‘best bit index’ method to match
output data and recover stored data.
22 Oct 2001
11
Implementation
•
•
•
•
•
The AURA C++ library
Implemented on PC or workstation
Beowulf parallel cluster
Origin 2000 supercomputer
Bespoke hardware
22 Oct 2001
12
Cortex-1
AURA parallel implementation
28 dedicated PCI based processors
Beowulf configuration
3.5Gb memory size
22 Oct 2001
13
UK eScience
• Aims to build on the concept of Grids
– To make computing and data provision as
direct and simple as electrical power delivery
• £110M initiative started 18 months ago
• DAME is a £3.5M pilot project to
demonstrate its application in the
engineering field.
22 Oct 2001
14
DAME Objectives
• DAME: Distributed Aircraft Maintenance
Environment.
• Demonstrate diagnostic capability on the
GRID
• Examine timeliness properties of the GRID
• Demonstrate on the RR Aeroengine
diagnostic problem
22 Oct 2001
15
University of Sheffield, P Fleming.
University of Leeds, Peter Dew, Alison McKay.
York, J Austin, J McDermid, A Wellings.
University of Oxford, Lionel Tarassenko.
Rolls-Royce
Rolls-Royce, Derby.
Data Systems & Solutions.
Cybula Ltd.
22 Oct 2001
16
Engine flight data
London Airport
Airline
office
New York Airport
Grid
Diagnostics centre
Maintenance Centre
American data center
European data center
22 Oct 2001
17
Diagnostic issues
• The system must analyse and report
– Novel engine operation
– Identify any cause of events
– Do this quickly
• Data
– Large (many Tb)
22 Oct 2001
18
Data – Zmod plots
22 Oct 2001
19
How does AURA contribute
• Search technology for multi-media data
• Parallel pattern match engine based on
neural networks.
• Built on Correlation Matrix Memories.
• High performance Beowulf and dedicated
hardware implementations.
• Commercially sold by Cybula Ltd.
22 Oct 2001
20
Engine data
Quote
Diagnostic
station
Novelty
indication
Data used
to identify
novelty
Data reduction
processes
Match requests
Features
Data to be
searched for
Data stores/
data
warehouse
Pattern match
results
Diagnosis
AURA-G
GRID
22 Oct 2001
21
Data sample
DM coding
Simple example of processing chain
22 Oct 2001
CMM
Matching
previous events
22
Frequency
Typical pre-processing
01101111011110111
DM coding
(1 up and 0 down)
Fast
Preserves information
Produces a binary vector
Time
22 Oct 2001
23
AURA-G
• This is a Globus enabled AURA
implementation.
• Developed under DAME
• Will be available end of 2002 for use in
other problems.
22 Oct 2001
24
AURA-G
• Support of scalable pattern matching
• Supports distributed search, across multiple
CMM engines at different sites
• OGSA compliant
22 Oct 2001
25
Grid Issues in Data Mining
• Data provenance
• Standards:
– Data transparency independent of location
– Managing DB/Data mining link in distributed
system
– OGSA DAI
22 Oct 2001
26
Conclusions
• AURA is a mature component for data
search and retrieval
• Robust software and hardware
implementation available
• Applications in e-Science for Grid
applications underway
22 Oct 2001
27
Contacts
Jim Austin
Dept Computer Science, University
of York, York, YO1O 5DD.
www.cs.york.ac.uk/arch
austin@cs.york.ac.uk
01904 432734
01904 432767
Cybula Ltd.
www.cybula.com
01377 236382
DAME : www.cs.york.ac.uk/dame
22 Oct 2001
28
Download