Informatics for Life: Fighting Natural Disasters with e-Science Moustafa Ghanem Jian Guo Liu Imperial College London John Hassard Discovery Net Project: Principal Investigator: Yike Guo Investigators: John Hassard, Jian Guo Liu, John Darlington, Tony Cass, Dan Ruckert Team: Moustafa Ghanem, Salman Al Sairafi, Peter Au, Vasa Curcin, Filippia Sofia Emmanouil, Stuart Hassard, Matt Howard, Steffi Klier, Jinming Ma, Roland Martin, Anton Olyenikov, Michelle Osmond, Kruti Patel, Fiona Pereira, Mark Richards, Anthony Rowe, Jameel Syed, Huy Vu, Patrick Wendel, Yong Zhang Discovery Net Goal : Constructing the World’s First Infrastructure for Global Wide Knowledge Discovery Services Key Features: Allow Institutions to Integrate, Manage and Utilise its Intellectual Property Open Service Computing High Throughput Devices and Real Time Data Mining Real Time Data Integration & Information Structuring Literature Scientific Discovery Real Time Data Integration Discovery Services Dynamic Application Integration Integrative Knowledge Management Databases Operational Data Using GRID Resources Images Cross Domain Knowledge Discovery and Management Discovery Workflow and Discovery Planning In Real Time Service Workflow Allow Scientists to Construct, Share and Execute Complex Knowledge Discovery Procedures & Services Key Technologies: Scientific Information Instrument Data Workflow Technology Native MPI Condor-G OGSA-service Web Service Resource Mapping Web Wrapper Sun Grid Engine Oralce 10g Unicore Workflow Execution A Compositional GRID Workflow Warehousing Workflow Authoring Composing Services Workflow Management Collaborative Knowledge Management Service Abstraction Workflow Deployment: Grid Service and Portal Discovery Net Applications Life Sciences High throughput genomics and proteomics Environmental Monitoring High throughput dispersed air sensing technology Real time Geo-hazard Modelling Earthquake modelling through satellite imagery 10 9 8 7 6 5 4 3 2 1 A B C D E F G H I J K L M N Progress and Achievements Infrastructure DNet Architecture Workflow System Prototype InfoGrid Service Integration Service Deployment High Performance Computing Challenge Workflow Warehousing 1,800 clicks, 350 cut and pastes, 200 database accesses, 250 access to computation services will be done in one workflow and one click Real Publish as Public Services Applications Case Studies Oracle 10g Fighting SARS OGSA-DAI Geohazard Modelling Environmental Monitoring Recognized for outstanding text mining ability in an international competition organized by the ACM (Association for Computing Machinery) Components/Services Data Analysis Text Mining GIS Image Mining International Collaborations Chemistry / Biochemistry Application Testbeds Life Science Life Science Environmental Monitoring Life Science Environmental Monitoring Geohazard Year 1 2002 Year 2 2003 Year 3 2004 Life Science Applications Real-time Collaborative Genome Annotation • Sanger Centre Collaboration, SC2002 Top Prize • Shanghai Bioinformation Institute: SARS Genome Annotation Large-scale Integrative Functional Genomics • Three National Micro-arrays Centres Worldwide • Three International Pharmaceutical Companies • Imperial College Projects (BAIR £5.5 M, Wellcome Trust) Real-time High Throughput Chemoinformatics • Collaboration with 2 International Pharmaceutical Companies • Novel projects integrating Bioinformatics & Chemoinformatics Imperial College Large-scale Genotyping Data Analysis Sanger KEGG EBI • SNP Analysis for Radiotherapy Effectiveness Study • SNP Analysis for Regional Diseases in SE Asia NCBI HPC Challenge SC2002 Nucleotide Annotation Workflows Interactive Editor & Visualisation Download sequence from Reference Server Real-time sequencing in London Inter Pro SMART KEGG EMBL NCBI SWISS PROT TIGR SNP GO Save to Distributed Annotation Server Distributed data and computation 1800 clicks 500 Web access 200 copy/paste 3 weeks work in 1 workflow and few second execution Execute distributed annotation workflow D-Net Based SARS Research Environment in China Genbank Homology search against viral genome DB Homology search against protein DB Annotation using Artemis and GenSense Annotation using Artemis and GenSense Predicted genes Gene prediction Exon prediction Key word search Splice site prediction GeneSense Ontology SARS genome sequence Multiple sequence alignment D-Net: Integration, interpretation, and discovery Relationship between SARS and other virus Phylogenetic analysis Mutual regions identification Homology search against motif DB Protein localization site prediction Protein interaction prediction Relationship between SARS virus and human receptors prediction Immunogenetics Microarray analysis SARS patients diagnosis Classification and secondary structure prediction Epidemiological analysis Bibliographic databases Bibliographic databases A Real Demonstration : SARS Genome Study Data Software :Discovery Net Work performed on 33 sample of SARS virus, sequenced from the Chinese patients Combined with publicly available data from National Center for Biotechnology Information (NCBI) Distributed analysis platform for scientific research Integration of both data, application and procedures Designed on top of the Grid service environment Analysis : Deeper understanding of the mutation patterns of the SARS virus Examining the variability of the virus on both genomic and proteomic level Providing full insight into the significance of changes in the nucleic structure of the virus Results: SARS-CoV Evolution Mutation Analysis Workflows Geohazard Modelling Study on Remote Sensing data mining for Geohazard Predication: landslides in 3-Gorge Dam/Reservoir Region in China Software development: automatic imagery coregistration and rectification function (running on Grid ) Grid-based Geo-hazard Data Mining Grid-based HPC Computation Workflow to Co-ordinate Computation Automatically co-register a stack of imagery layers at high precision and speed. Data Warehousing & Modelling Co-registration & geo-rectification Image features extraction Grid-based Data Access and Integration Cluster & classification Environmental Modelling Application: High Throughput Sensor Technology: GUSTO (Generic Ultra Violet Sensor Technology) Measures SO2, NO, NO2,O3 & Benzene at ppb levels simultaneously Geared for networking of multiple GUSTO units in a GRID Infrastructure 10 9 8 7 6 5 4 3 2 1 A B C D E F G H I J K L •Correlate With Multi-Species Data Set (i.e. The variation of one pollutant with respect to another) •Correlate With Medical Databases in Order to Identify Patterns in Acute Respiratory Occurrences •Correlate with Traffic Data M N Power of Rich Data Model: Workflow for Multi-Modality Analysis Data mining Text mining Spectrum data mining chemical/sequence data model Knowledge discovery from massive data processing for earthquake study Remote Sensing Applications in Discovery Net Sub-pixel Shift Measurement Using Multi-temporal Optical Images Dr J. G. Liu & Dr J. Ma Remote Sensing Unit Department of Earth Science and Engineering Imageodesy – two algorithms By accurately comparing optical images taken before and after a land deformation event (e.g. earthquake), it is possible to measure the horizontal shift caused by the event at sub-pixel level. This technique is called Imageodesy. This technique is a complementary to the well-established radar interferometry technique that is sensitive to vertical deformation. The major technique challenge of high accuracy imageodesy is the huge demand on computing to handle with massive data processing. We have developed software to implement the imageodesy on MPI parallel processor based on the conventional template normalised cross-correlation algorithm. Furthermore we developed new algorithm and software based on revolutionary phase correlation algorithm. Normalised cross-correlation (NCC) template algorithm Image “before” Image “after” Reading Data set Reading Data set Setting search window Setting comparing window Setting comparing window Significant correlation coefficient N Y Delta X Delta X Correlation coefficient Operating on a remotely accessed MPI UNIX parallel computer through fast network with Kensington interface. Slow but high accuracy: 24 processors 10 hours for one scene of Landsat-7 ETM+ Pan imagery data. The algorithm also run on GRID. Phase correlation algorithm Image “before” Image “after” Reading Dataset Reading Dataset Hamming Windowing Hamming Windowing FFTW FFTW Phase Correlation Inverse FFTW Delta X Delta Y Phase Correlation coefficient Phase Correlation Imageodesy algorithm scheme We have implemented phase correlation algorithm for Imageodesy operating on both UNIX and PC. This new algorithm has potential to increase the processing speed significantly. Post imageodesy processing The imageodesy software of both algorithms that we developed is so sensitive that it reveals the stripe patterns sensor of scan compensation or CCD calibration (a type of system error) beside the true information of land deformation. Massive smoothing filters and correlation coefficient criteria have been used to remove the effects of system error patterns from the final raster and vector presentations. The first scientific result: Kunlun earthquake An Ms 8.1 earthquake occurred on 14 Nov 2001 at 09:26:18 UTC in the East Kunlun Mountains along the Kusai Lake segment of Kunlun fault. An E-W to WNW-ESE direction surface rapture zone of 400 km long was produced and the leftlateral strike-slip was as large as 16.3 m according to field observations immediately after the earthquake conducted by Chinese scientists. The earthquake fault zone from space Horizontal shift image produced by imageodesy. The image shows average 5-10 m left-lateral shifts along the fault zone. Vector presentation Summary e-Science and Knowledge Discovery from Remote Sensing Applications e-Science provides massive processing power and resource: MPI and Grid enables high resolution imageodesy of very large datasets. Remote processing power and remote/local dataset. Interaction between e-Science and new development application algorithm can deliver tremendous processing power on desktop: phase correlation algorithm enables imageodesy processing on single PC or UNIX at very high speed. This makes real time remote imageodesy possible. Local processing power and remote/local dataset. Knowledge discovery: the result of earthquake study presented is an original contribution to the science and of high scientific value. It provides quantitative data of the tectonic movement of the region. The discovery is not possible without massive processing power of e-Science and research for algorithm design and software development supported by UK e-Science Pilot Project: Discovery Net. High Throughput Applications Knowledge Discovery from Real-time High Throughput Devices in Discovery Net John Hassard Monitoring Herbicide Resistant Oilseed Rape Farm Scale trials Proteomics -Distributed Protein analysis Monitoring Urban Air Pollution in Multipoint in Real-Time Real-time multi-site multi-species (NO, NO2, O3, SO2, Benzene) analysis Demonstration of GM Crop location and adjacent gene product transfer Herbicide Resistant Oilseed Rape (HTOSR) crops have been trialed in Farm Scale size experiments in the UK and North America. •Genetic Modification creates plants that are which are resistant to the herbicide Liberty®. •The transgene is based on the pat gene encoding the protein phosphinothricin acetyltransferase (PAT) from Alcaligenes faecalis. •It is 197 amino acids long which comes to a Molecular weight of 21213 Da. •This type of enzyme is quite common in GM and potentially might spread to a range of plant species. •The location is near Cambridge and was used for GM trials in 1999. •100 fields chosen at random around the test site are sampled at 100 points. Plant tissue is taken and analysed in high throughput Protein Analysis Systems. F/81 Field sampling by double line intercept, 100 points in each field. Sample field F/81 Mark at 5.8m in X and 3.9m in Y axis. Plants are nearest to intersects points are sampled. Plants samples analysed by fast a Protein Analysis System looking for additional protein spike. Of the 100 sample points some may show the presence of the GM protein product. Non-GM In every field a degree of GM presence can be ascertained,expressed as a % and displayed as a pie-chart on the map GM F/81 F/55 F/91 F/89 F/57 F/97 F/3 F/54 F/32 F/98 F/53 F/56 F/ 100 F/33 F/31 F/90 F/73 F/62 F/52 F/ 102 F/99 F/77 F/35 F/61 F/34 F/30 F/50 F/51 F/ 101 F/59 F/36 F/72 F/49 F/75 F/37 F/64 F/27 F/6 F/71 F/87 F/48 F/7 F/0 F/1 F/26 F/70 F/68 F/96 F/86 F/92 F/38 F/81 F/47 F/25 F/69 F/78 F/5 F/88 F/28 F/4 F/76 F/2 F/58 F/29 F/63 F/74 F/60 F/79 F/39 F/80 F/85 F/24 F/9 F/8 F/82 F/10 F/83 F/93 F/12 F/11 F/65 F/13 F/18 F/20 F/16 F/40 F/46 F/95 F/94 F/23 F/45 F/84 F/22 F/21 F/14 F/17 F/67 F/15 F/43 F/19 F/66 F/44 F/41 F/42 100 field site F/1 chosen at random around HT-OSR test site OilSeed Rape Crop B Crop C Crop D Woodland Recreation Un-Sampled Herbicide Resistant OilSeed Rape- HTOSR F/1 F/1 OilSeed Rape F/3 F/32 F/33 F/31 F/30 F/4 F/2 Herbicide Resistant OilSeed Rape- HTOSR F/1 F/5 F/29 F/27 F/6 F/28 F/7 F/1 F/26 F/25 F/9 F/8 F/24 F/10 F/12 F/11 F/13 F/20 F/23 F/18 F/16 F/17 F/22 F/21 F/19 F/14 F/15 F/55 F/57 F/54 F/56 Crop B F/53 F/52 F/35 F/34 F/51 F/36 F/50 F/49 Herbicide Resistant OilSeed Rape- HTOSR F/1 F/37 F/48 F/1 F/38 F/47 F/39 F/40 F/46 F/45 F/44 F/43 F/41 F/42 Crop C F/62 F/61 F/63 F/60 F/59 F/58 Herbicide Resistant OilSeed Rape- HTOSR F/1 F/64 F/1 F/70 F/68 F/69 F/65 F/67 F/66 F/73 F/77 Crop D F/74 F/72 F/75 F/78 F/76 Herbicide Resistant OilSeed Rape- HTOSR F/1 F/88 F/71 F/87 F/1 F/81 F/80 F/85 F/82 F/86 F/83 F/84 F/79 F/91 F/89 F/97 F/98 F/90 F/ 100 F/99 F/ 102 Woodland F/ 101 Recreation Herbicide Resistant OilSeed Rape- HTOSR F/1 F/1 F/92 F/96 F/93 F/95 F/94 F/55 F/91 F/89 F/57 F/97 HT OilSeed Rape. 1 F/3 F/54 F/32 F/98 F/53 F/56 F/ 100 F/33 F/31 OilSeed Rape. 31 F/90 F/73 F/62 F/52 F/ 102 F/99 F/77 F/35 F/61 F/34 F/30 F/50 F/51 F/ 101 F/75 F/48 F/37 F/6 F/7 F/0 F/1 F/26 F/68 F/96 F/86 F/92 F/38 F/81 F/47 F/25 F/79 F/39 F/80 F/85 F/24 F/9 F/8 F/82 F/10 F/83 F/93 F/12 F/11 F/65 F/13 F/18 F/20 F/16 F/94 F/23 F/45 F/84 F/22 F/21 F/14 F/40 F/46 F/95 13 Crop D. 18 F/17 F/67 F/15 F/43 F/19 F/66 F/44 F/41 F/42 Recreation Un-Sampled F/64 F/71 F/69 F/78 F/5 F/27 F/70 F/4 F/76 F/2 F/88 F/28 Crop C. F/36 F/72 F/58 F/87 24 Woodland F/59 F/49 F/29 F/63 F/74 F/60 Crop B. 14 Further correlations possible with prevalent weather conditions and other factors such as •Fauna •Elevation and physical geography •Physical features-rivers, woods etc. http://homepage.ntlworld.com/richard.barker4/archive/index/0312/winddir1.htm • Physical geography 50m contour OilSeed Rape F/3 F/32 F/33 F/31 F/30 F/4 F/2 Herbicide Resistant OilSeed Rape- HTOSR F/1 F/5 F/29 F/27 F/6 F/28 F/7 F/1 F/26 50m contour F/25 F/9 F/8 F/24 F/10 F/12 F/11 F/13 F/20 F/23 F/18 F/16 F/17 F/22 F/21 F/19 F/14 F/15 Other factors effecting gene transfer. • Physical geography •Meteorological factors Humidity Rainfall Windspeed Temperature Scenario 1 Random , low level transfer of GM Pollen and Seed Scenario 2 Heavier transfer to multiple crops , with some influence due to prevailing winds GM Containment Monitoring Protein or DNA GM Detection Database Govt. Agency HTS Data Acquisition Distributed Systems Geographical Location & Factors Response Advice BioTech Decision Making Release Alert Sampling times • Some GM crops are only fertile at set times. •Sampling of surrounding crops can be set to these or other schedules •Usually transfer occur by pollen or seed from cross-pollinated plants (i) Mode(s) of reproduction Autogamous and allogamous reproduction: oilseed rape is a crop capable of both self-pollination (approx. 70%) and cross-pollination (approx. 30%). The pollen, which is heavy and sticky, can be transferred from plant to plant through physical contact between neighbouring plants and by wind and insects. (ii) Specific factors affecting reproduction, if any Temperature (insect visits), humidity (pollen viability) and wind. Pollinating insects, in particular honeybees (Apis mellifera) and bumblebees (Bombus sp.) play a major role in B. napus pollination. (iii) Generation time: Between 6 and 12 months. •Very large amounts of data will be generated from GM sites all over the UK •10 000 data points for a single sample time Based on the real data sets from Food Standard Agency experiments in 2003 -Typical run time = 10-35 min. The raw data scales linearly with time and as the Protein Analysis System gets faster the data set will go down. Typical raw data size Typical processed data size = 16 MBytes (compressed) = 600 kBytes For 100 samples, we would have 1.6 GBytes raw + 60 Mbytes processed For 100 fields * 100 samples = 160 Gbytes + 6 Gbytes For 100 sites * 100 fields * 100 samples = 16 Tbytes + 600 Gbytes All this data must be collected, collated, analysed and passed to the DEFRA/FSA for timely response to problems Sensor Specification The GUSTO Project - Update (Generic UV Sensors Technologies & Observations) • High throughput open path spectrometer system • Robust algorithm for pollutant concentration retrievals • Measures SO2, NO, NO2,O3 & Benzene to ppb levels every few seconds • Geared for networking of multiple GUSTO units within a GRID Infrastructure • Can support Remote Sensing data for (contour) mapping of pollutants www.gusto-systems.com Instrument Set-up 3-60m variable open path www.gusto-systems.com Networking of Multiple GUSTO Units GUSTO unit 1 Wireless connectivity Monitoring and control software Sensor registry & control service GUSTO unit 2 GUSTO unit 3 SensorML HTTP, SOAP, GSI Data upload service HTTP, SOAP, GSI Warehouse Data access service Archived weather data GUSTO unit 4 Archived health data GRID Infrastructure www.gusto-systems.com Public access Web visualizer Visualisation and Data Mining (KDE) Scenario – Case Study High Levels of Local Aggregated Asthma Events…! 10 9 Increased Burden on Local Health Service 8 7 6 Monitor Region for NOx With Multiple Gusto Units 5 4 3 Use an e-Science Platform to Visualise & Correlate Data Flow 2 1 A B C D E F G H I J K L M N Above example will produce roughly 200kbs flow rate & 15 GB per day of raw data www.gusto-systems.com Visualisation – Static Average Larger circles correspond to greater pollution levels Gusto Systems are approx. 100m apart Above: Pollution levels are high near to school and a main road and is in close proximity. Ambient levels (away from main road) are much lower. www.gusto-systems.com Real time GIS-Visualiser Visualisation – School in Detail Levels high nr School Levels high nr School Levels low nr School 9:00 AM Rush hour + school run 3:30 PM School run only 5:00 PM Rush hour only Conclusion: Pollution levels near school are significantly affected by school run… www.gusto-systems.com Mapping & Analysis Workflow Pollution ‘hotspots’ become easily identifiable Analysis workflows may be deployed as web-service www.gusto-systems.com Automated Data Clustering also possible Correlation – data mining Correlate with multi-species data set (i.e. the variation of one pollutant with respect to another) Correlate with regional & national air quality databases Correlate with regional & national weather data, e.g. temperature, humidity, wind speed & direction etc Correlate with other remote sensing data for modelling & predictive research Correlate with medical databases in order to identify patterns in acute respiratory occurrences Effective data mining & warehousing in a multi- platform environment is vital and future work will focus on this aspect www.gusto-systems.com Summary Data – Retrieved (pollutant) concentration values from atmospheric spectra Data – Retrieved anomalous upregulated protein concentration values from samples Information – Real-time variations in Information – Real-time variations in pollutant conc. with (x,y,z,t) - leading dgree of DiscoveryNet is a proven andup-regulation with (x,y,z,t) to imaging and mapping capabilities leading to imaging and mapping highly effective platform in capabilities cross-disciplinary discovery science Knowledge – Increased understanding allowing key new insights in real-time of pollution sources and the dynamic Knowledge – Increased understanding multi-source data of analysis. processes influencing their behaviour transgenic samples and the in the atmosphere. Correlative studies dynamic processes influencing their between local pollutant levelsIt and behaviour is a highly efficient and in the geosphere. related health effects Correlative studies between local adaptable platform fortransgenic a wide range protein levels and related of cutting-edge disciplines health effects Benefit – Real-time Decision making based on best available knowledge leading to effective urban pollution Benefit – Real-time Decision making control based on best available knowledge leading to effective transgenic protein surveillance, knowledge of propagation and pathways