Distributed Data Mining in Discovery Net Dr. Moustafa Ghanem Department of Computing Imperial College London 1. 2. 3. 4. 5. What is Discovery Net Distributed Data Mining for Compute Intensive Tasks Distributed Data Mining for Sensor Grids Knowledge Discovery from Naturally Distributed Data Sources What Do Scientists Really Want? 1. What is Discovery Net What is Discovery Net? Funding : One of the eight UK national e-science Pilot Projects funded by EPSRC (£2.2M) Start Oct 2001, End March 2005 Goal :Construct the World’s first Infrastructure for Global Knowledge Discovery Services Key Technologies: Open Service Computing High Throughput Devices and Real Time Data Mining Real Time Data Integration & Information Structuring Cross Domain Knowledge Discovery and Management Discovery Workflow and Discovery Planning Discovery Net Applications Life Sciences High throughput genomics and proteomics Distributed Databases and Applications Environmental Modelling High throughput dispersed air sensing technology Sensor Grids Real time geo-hazard modelling Earthquake modelling through satellite imagery High performance Distributed Computation 10 9 8 7 6 5 4 3 2 1 A B C D E F G H I J K L M N Discovery Net Architecture DPML Web/Grid Services OGSA D-Net Clients: End-user applications and user interface allowing scientists to construct and drive knowledge discovery activities D-Net Middleware: Provides services and execution logic for distributed knowledge discovery and access to distributed resources and services High Performance Communication Protocol (GridFTP, DSTP..) Grid Infrastructure (GSI) Goal: Plug & Play • • • Data Sources, Analysis Components & Knowledge Discovery Processes Computation & Data Resources: Distributed databases, compute servers and scientific devices. Discovery Net Data Mining Components Generic Data Mining Classification, Clustering, Associations, .. Unstructured-Data Mining Text Mining, Image Mining Domain-specific Mining Bioinformatics, Cheminformatics, .. 2. Distribution of Compute Intensive Tasks a. Distributed Data Mining for Geo-hazard Prediction Grid-based Geo-hazard Data Mining Grid-based HPC Computation Automatically co-register a stack of imagery layers at high precision and speed. Data Warehousing & Modelling Co-registration & geo-rectification Grid-based Data Access and Integration Workflow to Coordinate Grid Computation Image features extraction Cluster & classification Normalised cross-correlation (NCC) template algorithm Image “before” Image “after” Reading Data set Reading Data set Setting search window Setting comparing window Setting comparing window Significant correlation coefficient N Y Delta X Delta X Correlation coefficient Operating on a remotely accessed MPI UNIX parallel computer through fast network with DNet interface. Slow but high accuracy: 24 processors 10 hours for one scene of Landsat-7 ETM+ Pan imagery data. The algorithm also run on GRID. 2. Distribution of Compute Intensive Tasks b. Distributed Clustering Workflows for Distributed Data Clustering 3. Distributed Mining over Sensor Grid Data Distributed Spatial Data Mining for Air Pollution Modelling Sensor Specification The GUSTO Project - Update (Generic UV Sensors Technologies & Observations) • • • • • High throughput open path spectrometer system Robust algorithm for pollutant concentration retrievals Measures SO2, NO, NO2,O3 & Benzene to ppb levels every few seconds Geared for networking of multiple GUSTO units within a GRID Infrastructure Can support Remote Sensing data for (contour) mapping of pollutants www.gusto-systems.com Networking of Multiple GUSTO Units GUSTO unit 1 Wireless connectivity Monitoring and control software Sensor registry & control service GUSTO unit 2 GUSTO unit 3 SensorML HTTP, SOAP, GSI Data upload service HTTP, SOAP, GSI Warehouse Data access service Archived weather data GUSTO unit 4 Archived health data GRID Infrastructure www.gusto-systems.com Public access Web visualizer Visualisation and Data Mining Pollution analysis 4. Knowledge Discovery from Naturally Distributed Data Sources Distributed Data Mining in Life Sciences Distributed Data Mining for Life Sciences secondary structure tertiary structure polymorphism patient records epidemiology expression patterns physiology sequences alignments ATGCAAGTCCCT AAGATTGCATAA GCTCGCTCAGTT receptors signals pathways linkage maps cytogenetic maps physical maps Information Integration Gene Expression Warehouse ExPASy SwissProt PDB ExPASy Enzyme OMIM Disease Enzyme Protein Affy Fragment LocusLink Known Gene MGD Sequence Metabolite SNP SPAD Sequence Cluster Genbank NCBI dbSNP NMR Pathway KEGG UniGene Given a collection of microarray generated gene expression data, what kind of questions the users wish to pose. Design an integration schema? From Data Integration to Knowledge Unification In Silico Experiment D-World I-World K-World Life Science Application: SC2002 HPC Challenge D-Net based Global Collaborative Real- Time Genome Annotation Identify High Throughput Sequencers Organism Chromosomes Identify Genes Gene markers Nucleotide-level Annotation Organism’s DNA Regulatory Regions Segmental Duplication Literature References tRNAs, rRNAs Non-translated RNAs Repetitive Elements SNP Variations EMBL NCBI TIGR SNP genscan blast grail Repeat Masker E-PCR genscan ….. Identify Protein-level Annotation Proteins Classify into Protein Families Functional Characteisation Homologues Domain 3-D Structure Fold Prediction Secondary structure Literature References ….. Inter Pro Inter Pro SMART SWISS PROT blast PFAM Motif Search predator DSC Relate Process-level Annotation Pathway Maps Cell Cycle Metabolism Drugs Biological Process….. Cell death Literature References 3D-PSSM Embryogenesis ….. GO CSNDB KEGG GK Ontologies AmiGO GeneMaps virtual chip GenNav 15 DBs 21 Applications Genome Annotation HPC Challenge SC2002 Nucleotide Annotation Workflows Interactive Editor & Visualisation Download sequence from Reference Server Real-time sequencing in London Inter Pro SMART KEGG EMBL NCBI SWISS PROT TIGR SNP GO Save to Distributed Annotation Server Distributed data and computation 1800 clicks 500 Web access 200 copy/paste 3 weeks work in 1 workflow and few second execution Execute distributed annotation workflow Discovery Net in Action: China SARS Virtual Lab Genbank Homology search against viral genome DB Homology search against protein DB Annotation using Artemis and GenSense Annotation using Artemis and GenSense Predicted genes Gene prediction Exon prediction Key word search Splice site prediction GeneSense Ontology Multiple sequence alignment D-Net: Integration, interpretation, and discovery Relationship between SARS and other virus Phylogenetic analysis Immunogenetics Mutual regions identification Microarray analysis Epidemiological analysis SARS patients diagnosis Homology search against motif DB Protein localization site prediction Protein interaction prediction Relationship between SARS virus and human receptors prediction Classification and secondary structure prediction Bibliographic databases Bibliographic databases Discovery Net in Action: SARS Virus Mutation Analysis 5. What do Scientist Really Want? Does it really work? Towards Compositional Grid Services Native MPI Condor-G Web Service Resource Mapping Web Wrapper Sun Grid Engine Service Browsing OGSA-service Oralce 10g Unicore Workflow Execution A compositional GRID Workflow Warehousing Workflow Authoring Composing services Workflow Management Collaborative Knowledge Management Service Abstraction Workflow Deployment: Grid Service and Portal Discovery Net Service Composition Full Workflow Executing Protein Annotation Workflow Deployment of Node Deploying Protein Annotation Workflow Executing Deployed Service Locating & Executing Deployed Service from Discovery Net Workflow Provenance Workflow Warehousing Discovery Net Snapshot Scientific Information Scientific Discovery In Real Time Real Time Data Integration Discovery Services Literature Service Workflow Databases Operational Data Dynamic Application Integration Integrative Knowledge Management Using Distributed Resources Images Instrument Data