Intersection of Semantic Web and Life Sciences Kei Cheung Yale Center for Medical Informatics Genomics and Bioinformatics (MBB 452a), November 2, 2005 Outline • Introduction • Overview of RDF and LSID • Semantic web applications – Connotea – Piggy Bank – YeastHub Two scientific/technological endeavors that have impacted the world greatly in the past 15 years • Human Genome Project (HGP) – International collaboration that began in 1990 and completed in 2003 – Understand the blueprint of life (moon-landing of the nineties) – Sequence the entire human genome • World Wide Web (WWW) – It was born in 1989/1990 at CERNS (developed by Tim BernersLee) – Revolutionize information access and sharing over the Internet (Gutenberg’s printing press) – Web browsers (e.g., IE, Netscape, FireFox) Relationship between HGP and WWW • HGP transformed life sciences into an information science, as large amounts of data have been generated, which need to be stored and analyzed – GenBank, EMBL, and DDBJ have recently reached a milestone of 100 billion bases from > 165,000 organisms – Pubmed has > 300,000 articles from > 150 life sciences journals • WWW has become the most popular medium for life scientists to distribute, access, share, and integrate different types of biological data over the Internet – As of 2005, there are 719 publicly available databases listed in NAR molecular biology database compilation Spider-Man: Spidey science gets a genetic makover http://www.genomenewsnetwork.org/articles/05_02/spiderman.php Spider-Man (Tim Berners-Lee) – Weaving the Web Semantic Web • "The Semantic Web is an extension of the current web in which information is given well-defined meaning, better enabling computers and people to work in cooperation." -Tim Berners-Lee, James Hendler, Ora Lassila, The Semantic Web, Scientific American, May 2001 • It provides a common framework that allows data to be shared and reused across application, enterprise, and community boundaries • It is based on the Resource Description Framework (RDF), which integrates a variety of applications using XML for syntax and URIs for naming. Semantic Web for Life Sciences (TBL, Bio-IT World Conference, May 2005) • “… also the people involved in the Semantic Web pushing it along are also excited about getting involved in the life sciences – it’s one of those areas that affect humankind, finding drugs, curing AIDS and cancer, etc. There seems to be a huge energy, and lots of practical technical reasons why this area is crying out to be one of the flagship areas that the Semantic Web really takes off …” Data Information Knowledge Navarro JD, Niranjan V, Peri S, Jonnalagadda CK, Pandey A. (2003) From biological databases to platforms for biomedical discovery. Trends Biotechnol. (6):263-8. Problem with the Current WWW Problem with the Current Web Kei Tsi Daniel Cheng (this is not me!!) Kei Cheung (15 years ago) Kei Cheung (9 months ago) Keyword Search: “regulatory variation” & “mammals” Data Heterogeneity • Lack of standard detailed description of resources • Data are exposed in different ways – Programmatic interfaces – Web forms or pages – FTP directory structures • Data are presented in different ways – Structured text (e.g., tab delimited format and XML format) – Free text – Binary (e.g., images) Data Heterogeneity (cont’d) • Nomenclature problem – Gene/protein names (based on phenotype, sequence, function, organisms, etc) • Armadillo (fruitflies) vs. i-catenin (mice) • PSM1 (human) = PSM2 (yeast); PSM1 (yeast) = PSM2 (human) • Sonic Hedgehog – ID proliferation • Different ID schemes: 1OF1 (PDB ID) and P06478 (SwissProt ID) correspond to Herpes Thymidine Kinase • Lexcial variation: GO1234, GO:1234, GO-1234 – Synonyms vs. homonyms • Dopamine receptor D2: DRD2, DRD-2, D2 • PSA: prostate specific antigen, puromycin-sensitive aminopeptidase, psoriatric arthritis, pig serum albumin • “Biologists would rather share their toothbrush than a gene name … Gene nomenclature is beyond redemption”, said Michael Ashburner From Web to Semantic Web (cont’d) • Human processing Machine processing • Use of Metadata – Free text description ontological description – HTML XML RDF or its extensions • Vision implementation HTML Example Readme 1 1 1 1 1 1 1 2 3 4 5 6 0 0 1 1 1 1 0 0 2 2 2 2 1 2 2 1 1 1 1 0 0 0 1 0 Col# 1 2 3 4 5 6 Description pedigree id Person id Father id Mother id Sex Status <html> <body> … <a href=“http://ycmi.med.yale.edu/ped_readme.html”> Readme</a> <table> <tr> <td>1</td> <td>1</td> <td>0</t> <td>0</td> … </tr> … </table> … </body> </html> What is XML? • • • • • • • eXtensible Markup Language It is self describing It is hierarchical It is human- and computerreadable It is a World Wide Web Consortium (W3C) standard It can be validated using DTD or XSchema There is a large software base support XML Example Proliferation of Bio-XML Formats Sequence BSML AGAVE Microarray Gene Expression GEML MAML BIND MAGE-ML RDF (e.g., BioPax) Semantically rich ontologies Reasoning (machine intelligence) Pathway SBML PSI-MI XML Representation of Proteomics Data AGML HUP-ML RDF Representation Resource Description Framework (RDF) • It is a standard data model (directed acyclic graph) for representing information (metadata) about resources in the World Wide Web • In general, it can be used to represent information about “things” that can be identified (using URI’s) on the Web • It is intended to provide a simple way to make statements (descriptions) about Web resources RDF Statement A RDF statement consists of: • Subject: resource identified by a URI • Predicate: property (as defined in a name space identified by a URI) • Object: property value (literal) or a resource For example, the dbSNP Website is a subject, creator is a predicate, NCBI is an object. A resource can be described by multiple statements. Graphical & XML Representation http://www.ncbi.nlm.nih.gov/SNP http://purl.org/dc/elements/1.1/creator http://www.ncbi.nlm.nih.gov http://purl.org/dc/elements/1.1/language en <?xml version="1.0"?> <rdf:RDF xmlns:rdf=“http://www.w3.org/1999/02/22-rdf-syntax-ns#” xmlns:dc=“http://purl.org/dc/elements/1.1” xmlns:ex=“http://www.example.org/terms”> <rdf:Description about=“http://www.ncbi.nlm.nih.gov/SNP”> <dc:creator rdf:resource=“http://www.ncbi.nlm.nih.gov”></dc:creator> <dc:language>en</dc:language> date> </rdf:Description> </rdf:RDF> Life Sciences Identifiers (LSIDs) • URL vs. URI vs. URN – URL http://www.gleaners.org/faq.html – URI http://www.gleaners.org/faq.html#Q04 – URN www.gleaners.org/faq.html#Q04 • LSID is a form of URN Problems of URIs • The web server referenced by the URL may be broken or become unavailable • The syntax of the URL may change over time as the underlying data retrieval program evolves • The data returned by a URL may change over time as the underlying database contents change. LISD Format and Examples • URN:LSID:namespace:database:object_id :[revision_id] • Examples: – URN:LSID:ncbi.nlm.nih.gov:genbank:AF2710 72' – URN:LSID:chemacx.cambridgesoft.com:ACX: CAS967582:1 LSID (cont’d) • Globalness: A LSID is a name with global scope that does not imply a location. It has the same meaning everywhere. • Uniqueness: The same LSID will never be assigned to two different objects. • Persistence: It is intended that the lifetime of an LSID be permanent. • Scalability: LSIDs can be assigned to any data element that might conceivably be available on the network, for hundreds of years. • Legacy Support: The LSID naming scheme must permit the support of existing legacy naming systems • Extensibility: Any scheme for LSIDs must permit future extensions to the scheme. • Independence: It is solely the responsibility of a name issuing authority to determine conditions under which it will issue a name. • Resolution: A URN will not impede resolution i.e., translation to a URL..." Semantic Web Applications • Connotea (on-line management of web resources) • Piggy bank (semantic web browser) • YeastHub: yeast genome data integration Connotea: Online Reference Management Service (Nature Publishing Group) www.connotea.org • To keep links to the articles/websites of your interest • To discover new articles and websites through sharing your links with other users • It is web-accessible TBL’s original vision of the Web • Active vs. passive • Collaborative vs. authoritative • Decentralized vs. centralized • Semantic vs. syntactic Connotea: Online Reference Management Service (Nature Publishing Group) ALFRED Population Sample Connotea (ALFRED Example) ALFRED Example Google Earth Example Data Integration Using RDF atagccgta cctgcgagt ctagaagct derives from human hemoglobin GenBank derives from atagccgta cctgcgagt ctagaagct + human hemoglobin is a oxygen transport protein human hemoglobin is a Gene Ontology + has 3D structure human hemoglobin has 3D structure Unified view Protein Data Bank oxygen transport protein Piggy Bank • http://simile.mit.edu/piggy-bank • It is an extension to the Firebox Web browser • It turns the Firebox Web browser into a Semantic Web browser • It supports tagging and links to Google Map RDF is the Common Currency Peggy Bank (Data Integration Example) TRIPLES (Expr. Data) HubMed Keyword search D2RQ RDF Expr. Dataset RDF Bib.. Info. import import Pluggin Browse/ query TRIPLES Expression Data in RDF Peggy Bank (PIM1 Gene) Semantic Bank Yeast Hub Yeast Hub Team Kei Cheung Kevin Yip Remko deKnikker Mark Gerstein Andrew Smith Andy Masiar RDF Technologies • Description of data source using Rich Site Summary (RSS) • Data Conversion into RDF – Relational Database to RDF (D2RQ) – Tabular-RDF-Conversion • RDF Database (Sesame) – RDF-based query languages Rich Site Summary (RSS) User (Application) aggregator Yeast Hub Resource No RSS No RSS RSS RSS Resources Resource Description (Use of Dublin Core Metadata) RDF Metadata Example (RSS1.0) Data Conversion and Integration Resource1 Resource2 Resource3 Resourcen <xml> … </xml> DOM/SAX RDF1 D2RQ XSLT RDF2 RDF3 RDF/DB <rdf> … </rdf> (Sesame) RDQL Users/Agents RDF Modeling of Tabular Data Genome Object Organism Object type Collection of specific types of genome objects (e.g., genes, proteins) Tabular-RDF Data Conversion Example of Data Converted into RDF Motivating Example • Genomic analysis of essentiality within protein networks. – H Yu, D Greenbaum, H Xin Lu, X Zhu, M Gerstein (2004) Trends Genet 20: 227-31. – Jeong, H., Mason, S., Barabási, A.-L., and Oltvai, Z. 2001. Lethality and centrality in protein networks. Nature 411: 41–42 – Fraser, H., Hirsh, A., Steinmetz, L., Scharfe, C., and Feldman, M. 2002. Evolutionary rate in the protein interaction network. Science 296: 750–752 • Important but hard… Example Integrated Query Essential genes YGDP MIPS BIND Protein-protein interactions (connectivity) SGD GO TRIPLES Annotation & expression data Query Form RQL Syntax and Query Results Next step… Data Mining • Whole yeast genome analysis (Y6K) – Subcellular localization of the yeast proteome. • A Kumar, S Agarwal, JA Heyman, S Matson, M Heidtman, S Piccirillo, L Umansky, A Drawid, R Jansen, Y Liu, KH Cheung, P Miller, M Gerstein, GS Roeder, M Snyder (2002) Genes Dev 16: 707-19. – A Bayesian system integrating expression data with sequence patterns for localizing proteins: comprehensive application to the yeast genome. • A Drawid, M Gerstein (2000) J Mol Biol 301: 1059-75. • Doing systematic dataming to predict the remining 3K localizations • Important but hard …. “Once the web has been sufficiently "populated" with rich metadata, what can we expect? First, searching on the web will become easier as search engines have more information available, and thus searching can be more focused. Doors will also be opened for automated software agents to roam the web, looking for information for us or transacting business on our behalf. The web of today, the vast unstructured mass of information, may in the future be transformed into something more manageable - and thus something far more useful.” (Ora Lassila) Automate humanely! “No amount of automation will replace human beings, but clumsy and belligerent automation will alienate them and suppress their creativity.” (Tony Kazic) Thanks! Questions? Semantic Graph Find the most current image of Kei Cheung who is affiliated with YCMI affiliated with Kei Cheung YCMI images Files member of member of File 1 File n date Oct 1, 1990 date Feb 1, 2005 Research/Technologies Related to Semantic Web • • • • Text mining Agent computing Web services Ontological research Knowledge representation Jill has_parent Joe has_child Sue has_child ? • A person (Joe) is an uncle iff – Joe is male – He has a parent (Jill) who has a second child (Sue) who is parent Other things to mention? • Taxonomy vs. ontology • OWL overview and example(s)