Informatics for Life: Fighting Natural Disasters with e-Science Discovery Net Project:

advertisement
Informatics for Life:
Fighting Natural Disasters with e-Science
Moustafa Ghanem
Jian Guo Liu
Imperial College London
John Hassard
Discovery Net Project:
Principal Investigator: Yike Guo
Investigators: John Hassard, Jian Guo Liu, John Darlington, Tony Cass, Dan Ruckert
Team: Moustafa Ghanem, Salman Al Sairafi, Peter Au, Vasa Curcin, Filippia Sofia
Emmanouil, Stuart Hassard, Matt Howard, Steffi Klier, Jinming Ma, Roland Martin,
Anton Olyenikov, Michelle Osmond, Kruti Patel, Fiona Pereira, Mark Richards, Anthony
Rowe, Jameel Syed, Huy Vu, Patrick Wendel, Yong Zhang
Discovery Net
Goal : Constructing the World’s First Infrastructure for Global Wide Knowledge
Discovery Services

Key Features:



Allow Institutions to Integrate, Manage and
Utilise its Intellectual Property

Open Service Computing

High Throughput Devices and Real Time Data
Mining

Real Time Data Integration & Information
Structuring

Literature
Scientific
Discovery
Real Time Data
Integration
Discovery
Services
Dynamic Application
Integration
Integrative Knowledge
Management
Databases
Operational
Data
Using GRID Resources
Images
Cross Domain Knowledge Discovery and
Management
Discovery Workflow and Discovery Planning
In Real Time
Service Workflow
Allow Scientists to Construct, Share and
Execute Complex Knowledge Discovery
Procedures & Services
Key Technologies:

Scientific
Information
Instrument
Data
Workflow Technology
Native MPI
Condor-G
OGSA-service
Web Service
Resource
Mapping
Web Wrapper
Sun Grid
Engine
Oralce 10g
Unicore
Workflow Execution
A Compositional GRID
Workflow
Warehousing
Workflow Authoring
Composing Services
Workflow Management
Collaborative Knowledge Management
Service
Abstraction
Workflow Deployment:
Grid Service and Portal
Discovery Net Applications
 Life Sciences
 High throughput genomics and proteomics
 Environmental Monitoring
 High throughput dispersed air sensing technology
 Real time Geo-hazard Modelling
 Earthquake modelling through satellite imagery
10
9
8
7
6
5
4
3
2
1
A B C D E F G H I J K L M N
Progress and Achievements
Infrastructure
DNet Architecture
Workflow System
Prototype
InfoGrid
Service
Integration
Service
Deployment
High Performance
Computing Challenge
Workflow
Warehousing
1,800 clicks,
350 cut and
pastes, 200
database
accesses, 250
access to
computation
services will be
done in one
workflow and
one click
Real
Publish as
Public
Services
Applications Case Studies
Oracle 10g Fighting SARS
OGSA-DAI
Geohazard Modelling
Environmental Monitoring
Recognized for outstanding text
mining ability in an international
competition organized by the ACM
(Association for Computing
Machinery)
Components/Services
Data Analysis
Text Mining
GIS
Image Mining
International Collaborations
Chemistry / Biochemistry
Application Testbeds
Life Science
Life Science
Environmental Monitoring
Life Science
Environmental Monitoring
Geohazard
Year 1
2002
Year 2
2003
Year 3
2004
Life Science Applications
Real-time Collaborative Genome Annotation
• Sanger Centre Collaboration, SC2002 Top Prize
• Shanghai Bioinformation Institute: SARS Genome Annotation
Large-scale Integrative Functional Genomics
• Three National Micro-arrays Centres Worldwide
• Three International Pharmaceutical Companies
• Imperial College Projects (BAIR £5.5 M, Wellcome Trust)
Real-time High Throughput Chemoinformatics
• Collaboration with 2 International Pharmaceutical Companies
• Novel projects integrating Bioinformatics & Chemoinformatics
Imperial College
Large-scale Genotyping Data Analysis
Sanger
KEGG
EBI
• SNP Analysis for Radiotherapy Effectiveness Study
• SNP Analysis for Regional Diseases in SE Asia
NCBI
HPC Challenge SC2002
Nucleotide Annotation Workflows
Interactive
Editor &
Visualisation
Download
sequence
from
Reference
Server
Real-time
sequencing
in London
Inter
Pro
SMART
KEGG
EMBL
NCBI
SWISS
PROT
TIGR
SNP
GO
Save to
Distributed
Annotation
Server
Distributed data
and computation
1800 clicks
 500 Web access
200 copy/paste
 3 weeks work
in 1 workflow and
few second execution
Execute
distributed
annotation
workflow
D-Net Based SARS Research
Environment in China
Genbank
Homology search against
viral genome DB
Homology search
against protein DB
Annotation using
Artemis and GenSense
Annotation using
Artemis and
GenSense
Predicted
genes
Gene prediction
Exon prediction
Key word
search
Splice site prediction
GeneSense
Ontology
SARS genome
sequence
Multiple sequence
alignment
D-Net:
Integration,
interpretation,
and discovery
Relationship
between SARS
and other virus
Phylogenetic analysis
Mutual regions
identification
Homology search
against motif DB
Protein localization
site prediction
Protein interaction
prediction
Relationship
between SARS virus
and human receptors
prediction
Immunogenetics
Microarray analysis
SARS patients
diagnosis
Classification and
secondary structure
prediction
Epidemiological analysis
Bibliographic databases
Bibliographic databases
A Real Demonstration :
SARS Genome Study

Data



Software :Discovery Net




Work performed on 33 sample of SARS virus,
sequenced from the Chinese patients
Combined with publicly available data from
National Center for Biotechnology
Information (NCBI)
Distributed analysis platform for scientific
research
Integration of both data, application and
procedures
Designed on top of the Grid service environment
Analysis : Deeper understanding of the
mutation patterns of the SARS virus


Examining the variability of the virus on both
genomic and proteomic level
Providing full insight into the significance of
changes in the nucleic structure of the virus
Results: SARS-CoV Evolution
Mutation Analysis Workflows
Geohazard Modelling
 Study on Remote Sensing data mining for
Geohazard Predication: landslides in 3-Gorge
Dam/Reservoir Region in China
 Software development: automatic imagery coregistration and rectification function (running on
Grid )
Grid-based Geo-hazard
Data Mining
Grid-based HPC Computation

Workflow to Co-ordinate
Computation
Automatically co-register a stack
of imagery layers at high precision
and speed.
Data
Warehousing
&
Modelling
Co-registration
&
geo-rectification
Image features
extraction
Grid-based Data Access and Integration
Cluster &
classification
Environmental Modelling
Application:

High Throughput Sensor
Technology:
 GUSTO (Generic Ultra Violet
Sensor Technology)


Measures SO2, NO, NO2,O3 & Benzene
at ppb levels simultaneously
Geared for networking of multiple
GUSTO units in a GRID Infrastructure
10
9
8
7
6
5
4
3
2
1
A
B
C
D
E
F
G
H
I
J
K
L
•Correlate With Multi-Species Data Set (i.e. The
variation of one pollutant with respect to
another)
•Correlate With Medical Databases in Order to
Identify Patterns in Acute Respiratory
Occurrences
•Correlate with Traffic Data
M
N
Power of Rich Data Model:
Workflow for Multi-Modality
Analysis
Data mining
Text mining
Spectrum data mining
chemical/sequence
data model
Knowledge discovery from
massive data processing for
earthquake study
Remote Sensing Applications in Discovery Net
Sub-pixel Shift Measurement Using Multi-temporal
Optical Images
Dr J. G. Liu & Dr J. Ma
Remote Sensing Unit
Department of Earth Science and
Engineering
Imageodesy – two
algorithms
By accurately comparing optical images taken before and after a
land deformation event (e.g. earthquake), it is possible to measure
the horizontal shift caused by the event at sub-pixel level. This
technique is called Imageodesy.
This technique is a complementary to the well-established radar
interferometry technique that is sensitive to vertical deformation.
The major technique challenge of high accuracy imageodesy is the
huge demand on computing to handle with massive data
processing.
We have developed software to implement the imageodesy on MPI
parallel processor based on the conventional template normalised
cross-correlation algorithm. Furthermore we developed new
algorithm and software based on revolutionary phase correlation
algorithm.
Normalised cross-correlation
(NCC) template algorithm
Image
“before”
Image
“after”
Reading
Data set
Reading
Data set
Setting
search
window
Setting
comparing
window
Setting
comparing
window
Significant
correlation
coefficient
N
Y
Delta X
Delta X
Correlation
coefficient
Operating on a remotely
accessed MPI UNIX parallel
computer through fast
network with Kensington
interface. Slow but high
accuracy: 24 processors 10
hours for one scene of
Landsat-7 ETM+ Pan
imagery data. The algorithm
also run on GRID.
Phase correlation algorithm
Image “before”
Image “after”
Reading Dataset
Reading Dataset
Hamming
Windowing
Hamming
Windowing
FFTW
FFTW
Phase
Correlation
Inverse FFTW
Delta X
Delta Y
Phase Correlation
coefficient
Phase Correlation Imageodesy algorithm scheme
We have implemented
phase correlation algorithm
for Imageodesy operating
on both UNIX and PC.
This new algorithm has
potential to increase the
processing speed
significantly.
Post imageodesy processing
The imageodesy software of both algorithms that we developed is so sensitive
that it reveals the stripe patterns sensor of scan compensation or CCD
calibration (a type of system error) beside the true information of land
deformation. Massive smoothing filters and correlation coefficient criteria have
been used to remove the effects of system error patterns from the final raster
and vector presentations.
The first scientific result:
Kunlun earthquake
An Ms 8.1 earthquake occurred on 14 Nov 2001
at 09:26:18 UTC in the East Kunlun Mountains
along the Kusai Lake segment of Kunlun fault.
An E-W to WNW-ESE direction surface rapture
zone of 400 km long was produced and the leftlateral strike-slip was as large as 16.3 m
according to field observations immediately after
the earthquake conducted by Chinese scientists.
The earthquake fault zone
from space
Horizontal shift image produced by imageodesy. The image shows average 5-10
m left-lateral shifts along the fault zone.
Vector presentation
Summary
e-Science and Knowledge
Discovery from Remote Sensing
Applications

e-Science provides massive processing power and resource: MPI and Grid
enables high resolution imageodesy of very large datasets. Remote processing
power and remote/local dataset.

Interaction between e-Science and new development application algorithm can
deliver tremendous processing power on desktop: phase correlation algorithm
enables imageodesy processing on single PC or UNIX at very high speed. This
makes real time remote imageodesy possible. Local processing power and
remote/local dataset.

Knowledge discovery: the result of earthquake study presented is an original
contribution to the science and of high scientific value. It provides quantitative
data of the tectonic movement of the region. The discovery is not possible
without massive processing power of e-Science and research for algorithm
design and software development supported by UK e-Science Pilot Project:
Discovery Net.
High Throughput Applications
Knowledge Discovery from Real-time High
Throughput Devices in Discovery Net
John Hassard
Monitoring Herbicide Resistant Oilseed
Rape Farm Scale trials
Proteomics -Distributed Protein analysis
Monitoring Urban Air Pollution in
Multipoint in Real-Time
Real-time multi-site multi-species (NO, NO2, O3, SO2,
Benzene) analysis
Demonstration
of GM Crop
location and
adjacent gene
product transfer
Herbicide Resistant Oilseed Rape (HTOSR) crops have been
trialed in Farm Scale size experiments in the UK and North
America.
•Genetic Modification creates plants that are which are resistant to
the herbicide Liberty®.
•The transgene is based on the pat gene encoding the protein
phosphinothricin acetyltransferase (PAT) from Alcaligenes faecalis.
•It is 197 amino acids long which comes to a Molecular weight of 21213 Da.
•This type of enzyme is quite common in GM and potentially might spread to
a range of plant species.
•The location is near Cambridge and was used for GM trials in 1999.
•100 fields chosen at random around the test site are sampled at 100 points.
Plant tissue is taken and analysed in high throughput Protein Analysis Systems.
F/81
Field sampling by double line intercept,
100 points in each field.
Sample field F/81 Mark at 5.8m in X and 3.9m in Y
axis. Plants are nearest to intersects
points are sampled.
Plants samples analysed by fast a Protein Analysis System looking for additional protein spike.
Of the 100 sample points some may show the presence of the GM protein product.
Non-GM
In every field a degree of
GM presence can be
ascertained,expressed as
a % and displayed as a
pie-chart on the map
GM
F/81
F/55
F/91
F/89
F/57
F/97
F/3
F/54
F/32
F/98
F/53
F/56
F/
100
F/33
F/31
F/90
F/73
F/62
F/52
F/
102
F/99
F/77
F/35
F/61
F/34
F/30
F/50
F/51
F/
101
F/59
F/36
F/72
F/49
F/75
F/37
F/64
F/27
F/6
F/71
F/87
F/48
F/7
F/0
F/1
F/26
F/70
F/68
F/96
F/86
F/92
F/38
F/81
F/47
F/25
F/69
F/78
F/5
F/88
F/28
F/4
F/76
F/2
F/58
F/29
F/63
F/74
F/60
F/79
F/39
F/80
F/85
F/24
F/9
F/8
F/82
F/10
F/83
F/93
F/12
F/11
F/65
F/13
F/18
F/20
F/16
F/40
F/46
F/95
F/94
F/23
F/45
F/84 F/22
F/21
F/14
F/17
F/67
F/15
F/43
F/19
F/66
F/44
F/41
F/42
100 field site
F/1
chosen at random
around HT-OSR
test site
OilSeed Rape
Crop B
Crop C
Crop D
Woodland
Recreation
Un-Sampled
Herbicide
Resistant OilSeed
Rape- HTOSR
F/1
F/1
OilSeed Rape
F/3
F/32
F/33
F/31
F/30
F/4
F/2
Herbicide
Resistant OilSeed
Rape- HTOSR
F/1
F/5
F/29
F/27
F/6
F/28
F/7
F/1
F/26
F/25
F/9
F/8
F/24
F/10
F/12
F/11
F/13
F/20
F/23
F/18
F/16
F/17
F/22
F/21
F/19
F/14
F/15
F/55
F/57
F/54
F/56
Crop B
F/53
F/52
F/35
F/34
F/51
F/36
F/50
F/49
Herbicide
Resistant OilSeed
Rape- HTOSR
F/1
F/37
F/48
F/1
F/38
F/47
F/39
F/40
F/46
F/45
F/44
F/43
F/41
F/42
Crop C
F/62
F/61
F/63
F/60
F/59
F/58
Herbicide
Resistant OilSeed
Rape- HTOSR
F/1
F/64
F/1
F/70
F/68
F/69
F/65
F/67
F/66
F/73
F/77
Crop D
F/74
F/72
F/75
F/78
F/76
Herbicide
Resistant OilSeed
Rape- HTOSR
F/1
F/88
F/71
F/87
F/1
F/81
F/80
F/85
F/82
F/86
F/83
F/84
F/79
F/91
F/89
F/97
F/98
F/90
F/
100
F/99
F/
102
Woodland
F/
101
Recreation
Herbicide
Resistant OilSeed
Rape- HTOSR
F/1
F/1
F/92
F/96
F/93
F/95
F/94
F/55
F/91
F/89
F/57
F/97
HT OilSeed Rape. 1
F/3
F/54
F/32
F/98
F/53
F/56
F/
100
F/33
F/31
OilSeed Rape. 31
F/90
F/73
F/62
F/52
F/
102
F/99
F/77
F/35
F/61
F/34
F/30
F/50
F/51
F/
101
F/75
F/48
F/37
F/6
F/7
F/0
F/1
F/26
F/68
F/96
F/86
F/92
F/38
F/81
F/47
F/25
F/79
F/39
F/80
F/85
F/24
F/9
F/8
F/82
F/10
F/83
F/93
F/12
F/11
F/65
F/13
F/18
F/20
F/16
F/94
F/23
F/45
F/84 F/22
F/21
F/14
F/40
F/46
F/95
13
Crop D.
18
F/17
F/67
F/15
F/43
F/19
F/66
F/44
F/41
F/42
Recreation
Un-Sampled
F/64
F/71
F/69
F/78
F/5
F/27
F/70
F/4
F/76
F/2
F/88
F/28
Crop C.
F/36
F/72
F/58
F/87
24
Woodland
F/59
F/49
F/29
F/63
F/74
F/60
Crop B.
14
Further correlations possible with prevalent weather
conditions and other factors such as
•Fauna
•Elevation and physical geography
•Physical features-rivers, woods etc.
http://homepage.ntlworld.com/richard.barker4/archive/index/0312/winddir1.htm
• Physical geography
50m contour
OilSeed Rape
F/3
F/32
F/33
F/31
F/30
F/4
F/2
Herbicide
Resistant OilSeed
Rape- HTOSR
F/1
F/5
F/29
F/27
F/6
F/28
F/7
F/1
F/26
50m contour
F/25
F/9
F/8
F/24
F/10
F/12
F/11
F/13
F/20
F/23
F/18
F/16
F/17
F/22
F/21
F/19
F/14
F/15
Other factors effecting
gene transfer.
• Physical geography
•Meteorological
factors
Humidity
Rainfall
Windspeed
Temperature
Scenario 1
Random , low level transfer of GM Pollen and Seed
Scenario 2
Heavier transfer to multiple crops , with some influence due to prevailing
winds
GM Containment Monitoring
Protein or DNA
GM Detection
Database
Govt. Agency
HTS Data
Acquisition
Distributed
Systems
Geographical
Location &
Factors
Response
Advice
BioTech
Decision
Making
Release
Alert
Sampling times
• Some GM crops are only fertile at set times.
•Sampling of surrounding crops can be set to these or other schedules
•Usually transfer occur by pollen or seed from cross-pollinated plants
(i) Mode(s) of reproduction
Autogamous and allogamous reproduction: oilseed rape is a crop capable of both self-pollination (approx. 70%)
and cross-pollination (approx. 30%). The pollen, which is heavy and sticky, can be transferred from plant to
plant through physical contact between neighbouring plants and by wind and insects.
(ii) Specific factors affecting reproduction, if any
Temperature (insect visits), humidity (pollen viability) and wind.
Pollinating insects, in particular honeybees (Apis mellifera) and bumblebees (Bombus sp.) play a major role in
B. napus pollination.
(iii) Generation time: Between 6 and 12 months.
•Very large amounts of data will be generated from GM
sites all over the UK
•10 000 data points for a single sample time
Based on the real data sets from Food Standard Agency experiments in 2003
-Typical run time = 10-35 min. The raw data scales linearly with time and as the Protein
Analysis System gets faster the data set will go down.
Typical raw data size
Typical processed data size
= 16 MBytes (compressed)
= 600 kBytes
For 100 samples, we would have
1.6 GBytes raw + 60 Mbytes processed
For 100 fields * 100 samples
= 160 Gbytes + 6 Gbytes
For 100 sites * 100 fields * 100 samples = 16 Tbytes + 600 Gbytes
All this data must be collected, collated, analysed and passed to the
DEFRA/FSA for timely response to problems
Sensor Specification
The GUSTO Project - Update
(Generic UV Sensors Technologies & Observations)
• High throughput open path spectrometer system
• Robust algorithm for pollutant concentration retrievals
• Measures SO2, NO, NO2,O3 & Benzene to ppb levels every
few seconds
• Geared for networking of multiple GUSTO units within a
GRID Infrastructure
• Can support Remote Sensing data for (contour) mapping of
pollutants
www.gusto-systems.com
Instrument Set-up
3-60m variable open path
www.gusto-systems.com
Networking of Multiple GUSTO Units
GUSTO
unit 1
Wireless
connectivity
Monitoring and
control software
Sensor registry &
control service
GUSTO
unit 2
GUSTO
unit 3
SensorML
HTTP,
SOAP,
GSI
Data upload
service
HTTP,
SOAP,
GSI
Warehouse
Data access
service
Archived
weather data
GUSTO
unit 4
Archived
health data
GRID Infrastructure
www.gusto-systems.com
Public access
Web visualizer
Visualisation and
Data Mining
(KDE)
Scenario – Case Study
High Levels of Local
Aggregated Asthma Events…!
10
9
Increased Burden on
Local Health Service
8
7
6
Monitor Region for NOx
With Multiple Gusto Units
5
4
3
Use an e-Science Platform
to Visualise & Correlate
Data Flow
2
1
A
B
C
D E
F
G H
I
J
K
L M
N
Above example will produce roughly 200kbs flow rate & 15 GB per day of raw data
www.gusto-systems.com
Visualisation – Static
Average
Larger circles
correspond to
greater pollution
levels
Gusto Systems
are approx.
100m apart
Above: Pollution levels are high near to school and a main
road and is in close proximity. Ambient levels (away from
main road) are much lower.
www.gusto-systems.com
Real time GIS-Visualiser
Visualisation – School in Detail
Levels high nr School
Levels high nr School
Levels low nr School
9:00 AM Rush hour +
school run
3:30 PM School run only
5:00 PM Rush hour only
Conclusion: Pollution levels near school are significantly
affected by school run…
www.gusto-systems.com
Mapping & Analysis Workflow
Pollution ‘hotspots’
become easily
identifiable
Analysis workflows may be
deployed as web-service
www.gusto-systems.com
Automated Data
Clustering also
possible
Correlation
– data mining
Correlate with multi-species data set (i.e. the variation of one
pollutant with respect to another)
Correlate with regional & national air quality databases
Correlate with regional & national weather data, e.g. temperature,
humidity, wind speed & direction etc
Correlate with other remote sensing data for modelling &
predictive research
Correlate with medical databases in order to identify patterns in
acute respiratory occurrences
Effective data mining & warehousing in a multi- platform
environment is vital and future work will focus on this aspect
www.gusto-systems.com
Summary
 Data – Retrieved (pollutant)
concentration values from
atmospheric spectra
 Data – Retrieved anomalous upregulated protein concentration values
from samples
 Information – Real-time variations in
 Information – Real-time variations in
pollutant conc. with (x,y,z,t)
- leading
dgree of
DiscoveryNet
is a proven
andup-regulation with (x,y,z,t) to imaging and mapping capabilities
leading to imaging and mapping
highly effective platform
in
capabilities
cross-disciplinary discovery science
 Knowledge – Increased understanding
allowing
key new insights
in real-time
of pollution sources and
the dynamic
 Knowledge
– Increased understanding
multi-source data of
analysis.
processes influencing their behaviour
transgenic samples and the
in the atmosphere. Correlative studies
dynamic processes influencing their
between local pollutant levelsIt and
behaviour
is a highly efficient
and in the geosphere.
related health effects
Correlative studies between local
adaptable platform fortransgenic
a wide range
protein levels and related
of cutting-edge disciplines
health effects
 Benefit – Real-time Decision making
based on best available knowledge
leading to effective urban pollution
 Benefit – Real-time Decision making
control
based on best available knowledge
leading to effective transgenic protein
surveillance, knowledge of
propagation and pathways
Download