Grids for Chemical Informatics Chemistry, IU Bloomington Oct. 21 2005 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories Indiana University Bloomington IN 47401 gcf@indiana.edu http://www.infomall.org Why are Grids Important Grids are important for Chemistry because they support key functionalities that grow in importance as we are deluged with data from instruments and simulations Grids provide information access, storage and management Grids manage multiple simulations with different defining parameters Grids allow complex workflows with data flowing between filters Grids define models for portals Grids are built on top of commodity web service technology with broad industry support – the next generation information technology Grids are used in multiple NIH and other life science/chemistry projects across the world (BIRN, caBIG, myGrid, Comb-e-Chem ) Internet Scale Distributed Services Grids use Internet technology and are distinguished by managing or organizing sets of network connected resources • Classic Web allows independent one-to-one access to individual resources • Grids integrate together and manage multiple Internetconnected resources: People, Sensors, computers, data systems Organization can be explicit as in • TeraGrid which federates many supercomputers; • Deep Web Technologies IR Grid which federates multiple data resources; • CrisisGrid which federates first responders, commanders, sensors, GIS, (Tsunami) simulations, science/public data Organization can be implicit as in Internet resources such as curated databases and simulation resources that “harmonize a community” Different Visions of the Grid Grid just refers to the technologies • Or Grids represent the full system/Applications DoD’s vision of Network Centric Computing can be considered a Grid (linking sensors, warfighters, commanders, backend resources) and they are building the GiG (Global Information Grid) Utility Computing or X-on-demand (X=data, computer ..) is major computer Industry interest in Grids and this is key part of enterprise or campus Grids e-Science or Cyberinfrastructure are virtual organization Grids supporting global distributed science (note sensors, instruments are people are all distributed Skype (Kazaa) VOIP system is a Peer-to-peer Grid (and VRVS/GlobalMMCS like Internet A/V conferencing are Collaboration Grids) Commercial 3G Cell-phones and DoD ad-hoc network initiative are forming mobile Grids Types of Computing Grids Running “Pleasing Parallel Jobs” as in United Devices, Entropia (Desktop Grid) “cycle stealing systems” Can be managed (“inside” the enterprise as in Condor) or more informal (as in SETI@Home) Computing-on-demand in Industry where jobs spawned are perhaps very large (SAP, Oracle …) Support distributed file systems as in Legion (Avaki), Globus with (web-enhanced) UNIX programming paradigm • Particle Physics will run some 30,000 simultaneous jobs Distributed Simulation HLA/RTI style Grids Linking Supercomputers as in TeraGrid Pipelined applications linking data/instruments, compute, visualization Seamless Access where Grid portals allow one to choose one of multiple resources with a common interfaces Parallel Computing typically NOT suited for a Grid (latency) Analysis and Visualization ADVANCED VISUALIZATION ,ANALYSIS QuickTime™ and a decompressor are needed to see this picture. Large Disks Old Style Metacomputing Grid COMPUTATIONAL RESOURCES LARGE-SCALE DATABASES Large Scale Parallel Computers Original: Spread a single large Problem over multiple supercomputers Now-1: Control multiple smallish jobs each on independent Computers Now-2: Choose which of a few supercomputers to use Towards an International Compute Grid Infrastructure US TeraGrid SDSC Starlight (Chicago) UK NGS Leeds Manchester Netherlight (Amsterdam) Oxford RAL NCSA PSC UCL UKLight SC05 All sites connected by production network (not all shown) Computation Steering clients Network PoP Service Registry Local laptops in Seattle and UK Information/Knowledge Grids Distributed (10’s to 1000’s) of data sources (instruments, file systems, curated databases …) Data Deluge: 1 (now) to 100’s petabytes/year (2012) • Moore’s law for Sensors Possible filters assigned dynamically (on-demand) • Run image processing algorithm on telescope image • Run Gene sequencing algorithm on compiled data Needs decision support front end with “what-if” simulations Metadata (provenance) critical to annotate data Integrate across experiments as in multi-wavelength astronomy Data Deluge comes from pixels/year available Data Deluged Science Now particle physics will get 100 petabytes from CERN using around 30,000 CPU’s simultaneously 24X7 Exponential growth in data and compare to: • • • • The Bible = 5 Megabytes Annual refereed papers = 1 Terabyte Library of Congress = 20 Terabytes Internet Archive (1996 – 2002) = 100 Terabytes Weather, climate, solid earth (EarthScope) Bioinformatics curated databases (Biocomplexity only 1000’s of data points at present) Virtual Observatory and SkyServer in Astronomy Environmental Sensor nets In the past, HPCC community worried about data in the form of parallel I/O or MPI-IO, but we didn’t consider it as an enabler of new science and new ways of computing Data assimilation was not central to HPCC DoE ASCI set up because didn’t want test data! Virtual Observatory Astronomy Grid Integrate Experiments Radio Far-Infrared Visible Dust Map Visible + X-ray Galaxy Density Map International Virtual Observatory Alliance • Reached international agreements on Astronomical Data Query Language, VOTable 1.1, UCD 1+, Resource Metadata Schema • Image Access Protocol, Spectral Access Protocol and Spectral Data Model, Space-Time Coordinates definitions and schema • Interoperable registries by Jan 2005 (NVO, AstroGrid, AVO, JVO) using OAI publishing and harvesting • So each Community of Interest builds data AND service standards that build on GS-* and WS-* • Imminent ‘deluge’ of data • Highly heterogeneous • Highly complex and inter-related • Convergence of data and literature archives myGrid Project The Williams Workflows A A: Identification of overlapping sequence B: Characterisation of nucleotide sequence C: Characterisation of protein sequence B C Web services Programs Computational resources service logic BPEL, Java, .NET Databases resources Humans <env:Envelope> <env:Header> ... </env:header> <env:Body> ... </env:Body> </env:Envelope> SOAP messages message processing Web Services build loosely-coupled, distributed applications, (wrapping existing codes and databases) based on the SOA (service oriented architecture) principles. Web Services interact by exchanging messages in SOAP format The contracts for the message exchanges that implement those interactions are described via WSDL interfaces. SOAP and WSDL Devices A typical Web Service In principle, services can be in any language (Fortran .. Java .. Perl .. Python) and the interfaces can be method calls, Java RMI Messages, CGI Web invocations, totally compiled away (inlining) The simplest implementations involve XML messages (SOAP) and programs written in net friendly languages like Java and Python Web Services WSDL interfaces Portal Service Security WSDL interfaces Web Services Payment Credit Card Catalog Warehouse Shipping control Two-level Programming I • The Web Service (Grid) paradigm implicitly assumes a two-level Programming Model • We make a Service (same as a “distributed object” or “computer program” running on a remote computer) using conventional technologies – C++ Java or Fortran Monte Carlo module – Data streaming from a sensor or Satellite – Specialized (JDBC) database access • Such services accept and produce data from users files and databases Service Data • The Grid is built by coordinating such services assuming we have solved problem of programming the service Two-level Programming II The Grid is discussing the composition of distributed services with the runtime Service1 Service2 interfaces to Grid as opposed to UNIX Service3 Service4 pipes/data streams Familiar from use of UNIX Shell, PERL or Python scripts to produce real applications from core programs Such interpretative environments are the single processor analog of Grid Programming Some projects like GrADS from Rice University are looking at integration between service and composition levels but dominant effort looks at each level separately Repositories Federated Databases Database Sensors Streaming Data Field Trip Data Database Sensor Grid Database Grid Research SERVOGrid Education Compute Grid Data Filter Services Research Simulations ? GIS Discovery Grid Services Customization Services From Research to Education Analysis and Visualization Portal Grid of Grids: Research Grid and Education Grid Education Grid Computer Farm SERVOGrid Requirements Seamless Access to Data repositories and large scale computers Integration of multiple data sources including sensors, databases, file systems with analysis system • Including filtered OGSA-DAI (Grid database access) Rich meta-data generation and access with SERVOGrid specific Schema extending openGIS (Geography as a Web service) standards and using Semantic Grid Portals with component model for user interfaces and web control of all capabilities Collaboration to support world-wide work Basic Grid tools: workflow and notification NOT metacomputing SERVOGrid Portal Screen Shots Earthquake Grid DoD NCOW Grid C2 (JBI CEE etc.) NCOW-IS Services CoI Specific… …Grids/Services Earthquake Data & Simulation Service ServoIS Information Grid 7: Portals Compute Grid 6: Collaboration Grid GIS Grid Sensor Grid 9: Application Services 10: Policy (ECS) 8: Data Access/Storage 4: Discovery 2: Security 11: Metadata Core Low Level Grid Services 3: Messaging 5: Mediation 1: Management Physical Network n: Service refers to core services identified by DoD CoI Community of Interest GIS Geographical Information System BioInformatics Grid Chemical Informatics Grid … HTS Tools Quantum Calculations CIS … Domain Specific Grids/Services 7: Portals Compute Grid MIS Grid Instrument Grid Information Grid 6: Collaboration Grid 9: Application Services 10: Policy 8: Data Access/Storage 4: Discovery 2: Security Sequencing Tools Biocomplexity Simulations BIS 11: Metadata Core Low Level Grid Services 3: Messaging 5: Workflow 1: Management Physical Network M(B,C)IS Molecular (Bio, Chem) Information System GIS Grid with WMS, WFS, data sources and GML <gml:featureMember> <fault> <name> Northridge2 </name> <segment> Northridge2 </segment> <author> Wald D. J.</author> <gml:lineStringProperty> <gml:LineString srsName="null"> <gml:coordinates> -118.72,34.243 118.591,34.176 </gml:coordinates> </gml:LineString> </gml:lineStringProperty> </fault> </gml:featureMember> ` WMS le ec tio n Fe a ol tur eC eC oll Ge tF ea e r tu r tu a Fe a Fe et G tur e Client io ct n s ad i l ro ] a R [a-b Railroads WFS Server Hi River [a-d] Bridge [1-5] ry SQL Query ue LQ SQ SQ L gw ay [1 2- Q ue 18 ry ] Interstate Highways Rivers Bridges 90 GML becomes CML, CellML, SBML Electric Power and Natural Gas data from LANL Interdependent Critical Infrastructure Simulations Zoom-in Zoom-out FeatureInfo mode Measure distance mode Clear Distance Drag and Drop mode Refresh to initial map Google maps can be integrated with Web Feature Service Archives to filter and browse seismic records. Integrating Archived Web Feature Services and Google Maps What is Happening? Grid ideas are being developed in (at least) four communities • Web Service – W3C, OASIS, (DMTF) • Grid Forum (High Performance Computing, e-Science) • Enterprise Grid Alliance (Commercial “Grid Forum” with a near term focus) Service Standards are being debated Grid Operational Infrastructure is being deployed Grid Architecture and core software being developed • Apache has several important projects as do academia; large and small companies Particular System Services are being developed “centrally” – OGSA or GS-* framework for this in GGF; WS-* for OASIS/W3C/Microsoft-IBM Lots of fields are setting domain specific standards and building domain specific services USA started but now Europe is probably in the lead and Asia will soon catch USA if momentum (roughly zero for USA) continues The Grid and Web Service Institutional Hierarchy 4: Application or Community of Interest Specific Services such as “Run BLAST” or “Look at Houses for sale” 3: Generally Useful Services and Features Such as “Access a Database” or “Submit a Job” or “Manage Cluster” or “Support a Portal” or “Collaborative Visualization” OGSA GS-* and some WS-* GGF/W3C/…. WS-* from Handlers like WS-RM, Security, Programming Models like BPEL OASIS/W3C/ Industry 2: System Services and Features or Registries like UDDI 1: Container and Run Time (Hosting) Environment Must set standards to get interoperability Apache Axis .NET etc. Location of software for Grid Projects in Community Grids Laboratory htpp://www.naradabrokering.org provides Web service (and JMS) compliant distributed publish-subscribe messaging (software overlay network) htpp://www.globlmmcs.org is a service oriented (Grid) collaboration environment (audio-video conferencing) http://www.crisisgrid.org is an OGC (open geospatial consortium) Geographical Information System (GIS) compliant GIS and Sensor Grid (with POLIS center) http://www.opengrids.org has WS-Context, Extended UDDI etc. The work is still in progress but NaradaBrokering is quite mature All software is open source and freely available Project Goals Establish Requirements from stakeholders • Research • Pharmaceutical Industry • Government Consider educational implications • e-Science v Bio/Chem/Molecular Informatics Consider other national and international projects to ensure we either lead or use best practice Design a Grid architecture and staged implementation Start pilot projects led by Chemistry/Chemical Informatics Evaluate and iterate Design and implement ?(Chem, Life Science, Science, Molecular) Informatics educational program that will attract students Write winning center grant in 2006-7 Web Services Introduction • What are “Web Services”? – A distributed invocation system built on Grid computing • Independent of platform and programming language • Built on existing Web standards – A service oriented architecture with • Interfaces based on Internet protocols • Messages in XML (except for binary data attachments) Web Services Introduction • A web-based architecture providing for interoperability among resources – Centralized service registry – Solves problems associated with finding, using, and combining online resources • Employ standard Internet protocols for: – Communication with resources – Automated discovery using centralized registries • Communicate with devices, people, and each other with the protocols and computer languages Service Oriented Architecture (SOA) • Goal is to achieve loose coupling among interacting software agents • Define service: a unit of work done by a service provider to achieve desired end results for a service consumer • Both provider and consumer are roles played by software agents on behalf of their owners. How does SOA work? • Two architectural constraints are employed – Small set of simple and ubiquitous interfaces to all participating software agents – Descriptive messages constrained by an extensible schema delivered through the interfaces Web Services Architectures • Individual services are registered globally – Broken down into individual services with inputs and outputs specified • Services are published • Services are requested • Open registry, publishing, and requesting Service-Oriented Architecture • From Curcin et al. DDT, 2005, 10(12),867 Web Services for Science • Invisible Services, Semantic Web, and Grid • Easy-to-use tools for any scientist • High throughput, resource intensive computing done for low cost/resources • Shared community – Collaborations between labs and fields – Shared data – Shared tools e-Science and the Grid 1 • e-Science: Major UK Program – global collaboration in key areas of science and the next generation of infrastructure that will enable it • reflects growing importance of international laboratories, satellites and sensors and their integrated analysis by distributed teams • total investment of some £200M over the five-year period from 2001 to 2006 • CyberInfrastructure: the analogous US initiative • Grid Technology: supports e-Science & Cyberinfrastructure Basic Architectures: Servlets/CGI and Web Services Browser GUI Client Browser HTTP GET/POST Web Server WSDL SOAP SOAP JDBC DB or MPI Appl. Web Server WSDL Web Server WSDL WSDL JDBC DB or MPI Appl. Importance of Web Services • Building a true science community • Enabling interoperability between tools and the integration of data • Less time coding, more time for science • Change the way scientists work by achieving new levels of integration When To Use Web Services? • Applications do not have severe restrictions on reliability and speed. • Two or more organizations need to cooperate. – One needs to write an application that uses another’s service. • Services can be upgraded independently of clients. • Services can be easily expressed with simple request/response semantics and simple state. Web Services Benefits • Web services provide a clean separation between a capability and its user interface. • Increase in productivity • Increase in flexibility • Rapid return on investment • Integration across multiple applications Web Services Advantages • Output in human- and computer-readable formats • I/O formats based on standard Internet protocols • Resources accessible server to server allow automated I/O • Integration based on specific services: you select services or data needed without downloading the entire data set Web Services Advantages • Description protocols provide details of service provided and interface components • Semantic Web standards increase efficiency • Use a central registry and standardized description of services • Quality and status of the information is dynamically available Web Services Drawbacks • • • • Based on new technologies Time and commitment required to learn Standards still in a state of rapid flux Issues with quality of data, (and for chemistry, quantity of open data), security, and privacy Components of Web Services • Protocols – SOAP – WSDL – UDDI • XML as a basis for the protocols • Ontologies – OWL: Ontology Web Language • Semantic Web Components of the Semantic Web for Chemistry • • • • XML – eXtensible Markup Language RDF – Resource Description Framework RSS – Rich Site Summary Dublin Core – allows metadata-based newsfeeds • OWL – for ontologies • BPEL4WS – for workflow and web services – Murray-Rust et al. Org. Biomol. Chem. 2004, 2, 31923203. SOAP: Simple Object Access Protocol • Flexible protocol to communicate information between server and server or client and server using XML • Supports Remote Procedure Calls • Allows layers (security, authentication, transactions) over the basic SOAP elements WSDL: Web Service Definition Language • Describes a service’s interface to clients • Services register themselves with Web Services • WSDL describes how to contact and interact with services – I/O, operations and messages to aid interaction with client WSDL Overview • An XML-based Interface Definition Language. – You can define the APIs for all of your services in WSDL. • WSDL docs are broken into five major parts: – Data definitions (in XML) for custom types – Abstract message definitions (request, response) – Organization of messages into “ports” and “operations” (classes and methods). – Protocol bindings (to SOAP, for example) – Service point locations (URLs) • Some interesting features – A single WSDL document can describe several versions of an interface. – A single WSDL doc can describe several related services. UDDI: Universal Description, Discovery, and Integration • Provides ways for clients and services to interact with other services • Uses XML • Defines the means of access, e.g., – URL – E-Mail • Defines services hosted by an entity • Business-oriented tags • Uses SOAP for communicating XML: eXtensible Markup Language • Allows definitions of types of documents • Tags are used to specify components of documents • Allows specification of namespaces to differentiate between identical tag names • Tag names do not provide semantics other than simple hierarchical relations XML Overview • A language for building languages • Basic rules: be well formed and be valid • Particular XML “dialects” are defined by XML schemas. – XML itself is defined by its own schema. • Extensible via namespaces • Many non-Web services dialects – RDF, SVG, GML, CML, XForms, XHTML • Many basic tools available: parsers, XPath and XQuery for searching/querying, etc. XML and Web services • XML lends itself to distributed computing: – It’s just a data description. – Platform, programming language independent • Web Services Description Language (WSDL) – Describes how to invoke a service – Can bind to SOAP, other protocols for actual invocation • Simple Object Access Protocol (SOAP) – Wire protocol extension for conveying RPC calls – Can be carried over HTTP, SMTP OWL: Web Ontology Language • Builds on RDF and RDFS and adds a means for richer descriptions of properties and classes – Disjoint classes – Cardinality of classes – Characteristics of relations, like symmetry Standards for Web Services • Business Process Execution Language for Web Services (BPEL4WS) • Ontology Web Language Semantics (OWL-S) • Web Service Modeling Ontology (WSMO) Standards Setting Boards • OASIS: Organization for Advancement of Structured Information Standards – ebXML: e-business XML – UDDI: Universal Description, Discovery and Integration • Global Grid Forum – community of users, developers, and vendors leading the global standardization effort for grid computing Standards Setting Boards • W3C: World Wide Web Consortium – OWL: Ontology Web Language – RDF/RDFS: Resource Description Framework/Schema – SOAP: Simple Object Access Protocol – URI/URL/URN: Universal Resource Identifier/Locator/Name – WSDL: Web Service Definition Language – XML: eXtensible Markup Language SWWS: Semantic Web-Enabled Web Services • Main objectives: – Provide a comprehensive Web Service description framework – Define a Web Service discovery framework – Provide a scalable Web Service mediation middleware • A program of the European Commission to run 2002-2005 – http://swws.semanticweb.org Web Services Integration Projects: Biosciences • myGrid – http://www.mygrid.org.uk/ • BIOPIPE – http://biopipe.org/ • BioMOBY – http://biomoby.org/ Web Services for Chemistry: Problems • Performance and scalability • Proprietary data • Competition from high-performance desktop applications -- Geoff Hutchison, it’s a puzzle blog, 2005-01-05 • ALSO: – Lack of a substantial body of trustworthy Open Access databases – Non-standard chemical data formats (over 40 in regular use and requiring normalization to one another) Missing Ingredients in Chemistry • Chemical communities to assemble Open Access databases – Well-defined quality assurance procedures performed by distributed peer-review systems – Software underlying the databases needs to be open source. Chemistry Databases on the Web • Marc Nicklaus lists 37 databases as of October 2001 – Must have structure searching and at least 100 molecules – http://cactus.nci.nih.gov/ncidb2/chem_www.html • SoaringBear’s List has 15 databases – http://geocities.com/soaringbear/biomed/chem.html Institutional Repositories • NARSTO Quality Systems Science Center – http://cdiac.esd.ornl.gov/programs/NARSTO/ – Pollutant species in the troposphere over North America – Part of the Carbon Dioxide Information Analysis Center at ORNL – NARSTO Data and Information Sharing Tool • http://mercury.ornl.gov/narsto/ Public Data Repositories • Developmental Therapeutics Program/NCI – Some assay data for download – Structures for over 200,000 compounds • http://dtp.nci.nih.gov/docs/dtp_search.html • Zinc and other screening databases • NIST computational chemistry database • Environmental fate and exposure databases Other Public Repositories 1 • ChemExper Chemical Directory – > 200,000 substances; > 10,000 IR spectra – http://chemexper.com/ • HIC-Up; Hetero-Compound Identification Centre – Uppsala – 5384 substances as of 1/15/05 – http://xray.bmc.uu.se/hicup/ • Chemicals with Pharmaceutical Activity; a 3D Structural Database – 400 3D structures – http://www.chem.ox.ac.uk/mom/chemical-database/ Other Public Repositories 2 • Cheminformatics.org – 41 data sets in 9 categories as of 8/18/05 – http://www.cheminformatics.org/ • WebReactions – http://webreactions.net/ Other Public Repositories 3 • MolTable – http://www.moltable.org/ • MatWeb Materials Property Data – http://www.matweb.com/index.asp?ckck=1 • Spectral Database for Organic Compounds (SDBS) – Over 32,000 compounds – Has EI-MS, FT-IR, 1H NMR, 13C NMR, Raman, ESR – http://www.aist.go.jp/RIODB/SDBS/cgi-bin/cre_index.cgi • NMRShiftDB (Christoph Steinbeck) – 14,753 structures as of 8/19/05 – Features peer-reviewed submission of data sets – http://www.nmrshiftdb.org/ Other Public Repositories: Commercial Teasers • FTIRsearch.com (Thermo Electron) – Demo file of 575 spectra from 87,000 in the full database – https://ftirsearch.com/default3.htm • ChemACX – 30 of >350 suppliers catalog data – http://chemacx.cambridgesoft.com/chemacx/index.asp • Sunset Molecular Discovery, LLC – Wombat (World of Molecular BioAcTivity) • 117,007 entries with over 230,000 biological activities – Wombat PK • Database for Clinical Pharmacokinetics: 643 substances with 4668 measurements – Three sample files from Wombat containing 341 Histamine-1 receptor antagonists – http://www.sunsetmolecular.com/ BlueObelisk.org • A group of chemists, programmers, and informaticians working collaboratively on projects such as: – – – – – – – – – Chemistry Development Kit (CDK) JChemPaint Jmol JUMBO NMRShiftDB Octet Open Babel QSAR World Wide Molecular Matrix (WWMM) Indiana University Existing Projects • System for the Integration of Bioinformatics Services (SIBIOS) – http://sibios.engr.iupui.edu • PlatCom: A Platform for Computational Comparative Genomics – http://bio.informatics.indiana.edu/sunkim/Platcom/ • Reciprocal Net – http://www.reciprocalnet.org/index.html Indiana University Planned Projects • Design of a Grid-based distributed data architecture • Development of tools for HTS data analysis and virtual screening • Database for quantum mechanical simulation data • Chemical prototype projects – Novel routes to enzymatic reaction mechanisms – Mechanism-based drug design – Data-inquiry-based development of new methods in natural product synthesis Web Services Future • Depends on – Adoption of standards – Incorporation of WS in current and newly developed applications – Security, privacy, quality of data issues – Development of WS tools and resources for e-Science