Intelligent Information Systems 7. Bioinformatics based on NRC* Bioinformatics workshop on Data Integration, Washington DC, February 2000 to be published as an NRC/NAS report Gio Wiederhold et al. EPFL, April-June 2000, at 14:15 - 15:15, room INJ 218 *NRC = National Research Council, Analysis and publication arm of the U.S. National Academy of Sciences 7/26/2016 EPFL7B - Gio spring 2000 1 Schedule Presentations in English -- but I'll try to manage discussions in French and/or German. • Material covered in an integrating fashion, drawing from concepts in databases, artificial intelligence, software engineering, and business principles. 1. 13/4 Historical background, enabling technology:ARPA, Internet, DB, OO, AI., IR 2. 27/4 Search engines and methods (recall, precision, overload, semantic problems). 3. 4/5 Digital libraries, information resources. Value of services, copyright. 4. 11/5 E-commerce. Client-servers. Portals. Payment mechanisms, dynamic pricing. 5. 19/5 Mediated systems. Functions, interfaces, and standards. Intelligence in processing. Role of humans and automation, maintenance. 6. 26/5 Software composition. Distribution of functions. Parallelism. [ww D.Beringer] 7. 31/5 Application to Bioinformatics. ----------- break --------8. 15/6 Resolving semantic heterogeinty. Educational challenges. 9. 22/6 Privacy protection and security. Security mediation. 10.29/6 Summary and projection for the future. • Feedback and comments are appreciated. 7/26/2016 EPFL7B - Gio spring 2000 2 Bio-Information • to learn about ourselves, – our origins, our place in the world • Humans, other Primates, Mice, Zebrafish, Fruit Flies* (drosophilae), Roundworms* (c.elegans), viruses as HIV*, Yeast*, plants as corn* – modesty, seeing how much we share with all organisms – not just of philosophical interest, but also • to help humanity to lead healthy lives – to create new scientific methods – to create new diagnostics – to create new therapeutics 7/26/2016 EPFL7B - Gio spring 2000 * substantially/completely sequenced. also bacterium* (Haemophilus influenzae) 3 Bioinformatics Information systems applied to biology and healthcare • Biomedical statistics Image data, • …, … Genomics - information related to gene-derived data • an subset of major interest • boundary often unclear: nature versus nurture & lifestyle – A person’s Genomic make-up has a major effect on susceptibility to diseases: positive and negative smoking, – Major genomic errors prevent birth, hence exposure to smoke & lungcancer – we deal with differences that are relatively minor 289 / ~10 000 genes suspected/identified – complexity: most health effects are also combinatorial multiple genes, promotors, inhibitors, metabolic cross-roads 7/26/2016 EPFL7B - Gio spring 2000 4 Integrating knowledge Meeting in February 2000 bought together biologists and computer scientists from academia, industry and government to discuss salient issues in biological computing. The following topics were covered: • the generation and integration of biologic databases; • interoperability of heterogeneous databases; • integrity of databases; • modeling and simulation, • data mining, and • visualization of "model fit” to data No single person can cover -- understand -- it all anymore Report of Feb. 2000 meeting to be published by NAS press 7/26/2016 EPFL7B - Gio spring 2000 5 Players • Human Genome project (NIH-NCGI & Wellcome trust) $250M, 1988-- 2005, but likely roughly in completed in 2000/2001? – – • Technology and strategies caused exponential rates of improvement – – – – • work at Universities, related research labs, split per 24 chromosomes collected in public databases www.ncbi.nlm.nih.gov/genome/seq • 100 M in 1998 (well annotated, with paper publishing) • 2.100 M by March 2000. ~12,000 base pairs per day in 1999. PCMR allows and multiplication of strands based on partial initial tags automation [Perkins-Elmer Biosystems, Affymetrix…] piece-wise (100-1 000) analysis and subsequent assembly versus walking the gene pieces overlap, software to match Private enterprises at various levels – – – – 7/26/2016 not-for-profit [The institute for Genomic Research (TIGR) dir. Craig Ventner] for profit [Celera Genomics (Ventner), Incyte] sell leads to pharmaceutical companies Early discovery pharmaceuticals [HGS Inc, Millenium Ph.] Established Pharmaceutical companies in-house [all now],support drug development, trials on animals humans, {toxicity, then benefit} trials, marketing. EPFL7B - Gio spring 2000 6 Quantities Progress 1 human The human genome: ~ 3 200 000 000 base pairs < 10 000 proteins ? diseases Genes, and gene abnormalities 6 000 000 000 humans Everybody’s genes <1000 systems Metabolic pathways ~2 000 000 molecules Small organic molecules - affect proteins - suitable for drugs 7/26/2016 EPFL7B - Gio spring 2000 7 Relationships • Basepairs: certain pairs of 4 amino acids: ACGT • adenine, cytosine, guanine, thymine, combine in double helix • 3 basepairs define 12 amino acids < (43 = 64) • Proteins: – – – – determined by certain sequences O(100) of amino acids: genes assembled by Ribosome according to RNA template from DNA coded in ~3% of the DNA sequence -- but where? 97% is miscellaneous: promotors / inhibitors / historical junk multiple genes for many proteins provide redundancy? 7/26/2016 EPFL7B - Gio spring 2000 8 Decoding the DNA sequence • Multiplying the source material – PCMR -- matches start, copies • Walking the sequence – using enzymes to cut, match by next piece – extract characteristics from segments by chromotography • Shotgun approach – cut into many segments (EST), will overlap arbitrarily – analyze automatically (automated chromatography) – assemble permutations and determine by best match • Determine the likely function by best match with known genes – BLAST-based tools, similar structure implies similar function 7/26/2016 EPFL7B - Gio spring 2000 9 Matching of sequences • Difficult because of – errors in source amino-acid sequence – missing subsequences, extra strands – meaningful variation: • HIV reverse transcriptase (RT) & protease is characterized by many mutations [http://hivdb.Stanford.edu] – Variation in regions that do not code for proteins – Loops and repeats in sequences • Several tools: BLAST family, GRAIL – search for similarities – can create errors 7/26/2016 EPFL7B - Gio spring 2000 10 Disease specific Same process with organ cells from diseased person(s) • Problem: cells carry complete DNA – look for family traits • Iceland study – compare to presumed healthy person’s DNA • many differences are irrelevant • Protein concentrations in cells differ – identify tissue samples to localize likely actions – test metabolic susceptibility in those cells to narrow functions Requires mix of computer an biological competence 7/26/2016 EPFL7B - Gio spring 2000 11 Diagnostics versus Drugs Diagnostic: • Analysis of individuals’, family members makeups • Differences between diseased and normal persons • simple case: 1 gene abnormality <-- -> 1 disease • common: several genes, related to same protein --> • complex: mixed metabolic paths, multiple proteins --> • affected by nutrition (diabetes), lifestyle, random events • precursor for understanding, intervention --- --- ---> Drug development? 7/26/2016 EPFL7B - Gio spring 2000 12 Disease targeting • Early stage of drug development Now have an excess of targets (reverse of situation in pharmaceutical companies 5 years ago) Find chemicals that affect those targets -- block or enhance – from corporate libraries -- with known attributes – from methodically created variations - chemical chips • high-throughput screening - 10 000/day – if significant, identify, produce samples for further tests • Still takes a long time to have marketable drugs • toxicity tests in cells, in animals, humans • effectiveness tests in animals, humans • failure rate 80%? failure rate 90%? Many pathways affects many diseases. Balance and mix? 7/26/2016 EPFL7B - Gio spring 2000 13 2D to 3D conversion Understand actual interaction effects requires 3-D models Protein folding • Strand of DNA, template for RNA, becomes protein, assumes a tight, 3_D shape • The shape determines the attachment points to cells, – nature does folding in a few nanoseconds – computation based on finding minimum energy conformations would take many years – current research tries to break computation by recognizing common substructure types: alpha-helixes, beta sheets, … Hardest genomics research issue today 7/26/2016 EPFL7B - Gio spring 2000 14 Use of 3-D configuration Match protein or derivative to cell surface • Attachment points – nooks and crannies: zincfingers, sockets – permeability of membranes / cell walls for certain proteins • Tools – Visualization – Docking programs – Computation of fit (minimal energy?) • Research limited by lack of knowledge – 3-D configuration of proteins and cells – effect of in-vivo deformations 7/26/2016 EPFL7B - Gio spring 2000 15 Identification • Match patterns of two samples – label amino acids with fluorescent markers – does not require functional genomic knowledge – PCMR multiplies sample size – Fluorescent activated cell sorters can separate cells, Ex.:separate embryo cells from mother’s blood by labeling with father’s genes and matching – Familial ties, human migrations, ... child that died in French prison was Louis XVII by tissue comparison with current relatives – Ancestry of species by creating hierarchical difference trees uses “junk portions” of genome - functions no longer needed 7/26/2016 EPFL7B - Gio spring 2000 16 Multiple Representations propri. propri. propri. propri. Genbank propri. text & structured Chem DNA Strings ProteinDB Chemical structures 2D - 3D 7/26/2016 Descriptions & Statistics of disease/normal cases 5 billion bytes Literature: 50 billion bytes of text covering Genbank. Bibliogra phic Citations Essential Det EPFL7B - Gio spring 2000 ail public hospitals corporate Family Traces 17 Heterogeneity inhibits Integration • An essential feature of science – autonomy of fields – differing granularity and scope of focus – growth of fields requires new terms • A feature of technological process – standards require stability -- not seen now in genomics – yesterday’s innovations are today’s infrastructure • Must be dealt with explicitly – sharing, integration, and aggregation are essential – large quantities of data require precision • Precision is critical -– whenever we deal with 100 000’s of instances, even a 1% false positive rate means following up on 1000 false leads. When those leads are people we must be extra careful. 7/26/2016 EPFL7B - Gio spring 2000 18 Clinical: Diagnosis Diagnosis is more advanced than treatment • Match patient tissue sample pattern to rich pattern – VLSI technology used to place 10 000 known genes on a chip surface – look for matches of expressed genes vs expectations in cells from diseased tissue (skin for melanoma, …) – can distinguish, say, cancers, that require specific treatment, but are indistinguishable by pathologists • Follow with – traditional treatments, if any – but earlier / more aggressive / more specific – being careful – haemophilia – being emotionally more prepared 7/26/2016 EPFL7B - Gio spring 2000 19 Clinical Treatment Only few choices now, take many years to develop, test Two ways to get good genes to work • in vivo -- problem: rejection • put virus (can penetrate cells) with repaired gene into cells • those cells now generate proper protein • expect cells to replicate, and create more protein • in vitro -- problem: getting protein to right places • use bacteria to replicate gene • let them manufacture needed proteins • inject proteins 7/26/2016 EPFL7B - Gio spring 2000 20 Clinical Treatments 2 Or, block bad genes , all in vivo -- problem: knowledge, getting there • flood area with decoy promotors – fool the ribosome, prevent transcription from DNA to RNA • block RNA from being a model for more DNA – use anti-sense molecules to create wrong double helix segments • stiffle cells by synthetic antibodies (for cancers) – block growth factor attachment for its proteins, by providing fakes 7/26/2016 EPFL7B - Gio spring 2000 21 Integration projects and topics Meeting was intended to bring people together • presentation present individual projects and concerns can’t do anything else if you want to be real Follow-up? Notes that follow are a • Individuals personal record, • Funding evaluation not a formal transcript. • Specific interoperation projects? Public versus proprietary interests • Government funding is much less than industrial funding • Where is leverage for interoperation? 7/26/2016 EPFL7B - Gio spring 2000 22 Database Annotation: adds meaning [Chris Overton, U Penn] Genome annotation original/subsequent - know Provenance • Provides links to data sources and to to encoded proteins. • predicts and archives landmarks. Genbank majority of entries have annotation ambiguity. Poor advice of changes other than to sequence. PDB does not list all binding sites found in proteins Lack of motivation/confidence of authors? [Weissig, Bioinformatics 99], Errors come from • experimental data • manual curation from the literature, • computational predictions (Grail uses Neural nets) • propagated increasingly in computation and integration. K2 mediator project (GAIA DB) [Chris Overton, Univ. of Pennsylvania] Uses Genbank, SwissProt, TRRD , GERD, TRANSFAC, MEDLINE. (some have moderately or highly restrictive licenses) Looks for syntactic errors, using a formal grammar for eukaryotic genes matching introns and exons (implied in GD), also actual coding regions. propagated spelling problems 7/26/2016 EPFL7B - Gio spring 2000 23 Database Correctness [Bill Anderson, Knowledge Bus Inc, Hanover MD] Develop Methods for correcting errors `Debabelization’ for EML( European Media Lab.) and EMBL, (Heidelberg) : All their work (Data Alive) is based on an ontology for biochemical databases Syntactic: errors: formats Semantic: interpretation of relations -- ontology Pragmatic errors - true data differences (experiment, transcriprition) biochemical ontology --> microanatomy -->{spatial, events} , chemistry ->{spatial, events}--> (several 100 axioms as constraint rules) Either the database or the constrainig ontology is wong) When a fault occurs go back to pragmatics, no automatic curation. 7/26/2016 EPFL7B - Gio spring 2000 24 Database Curation [Michael Cherry, SGD database,. genome-www.stanford.edu] [Michael Ashburner for fly, mouses, and yeast (saccarides)] Long-term quality control: Curation is the act of establishing and maintaining a database, here often the chromosome or species-specific databases. Similar task to what a journal editor does, Curator also functions as an Educator, Ontologist [Yaahoo term] Learn what aids the community needs, and build the museum to satisfy those needs [John Cotten Dana, 1850] Set limits according to what you can do and obtain. Find missing details in literature, include summary paragraphs. Requires a gene ontology for molecular function, Information on cellular location (absolute or relative), Used for annotating results from microarrays. 7/26/2016 EPFL7B - Gio spring 2000 25 Organizing Genomic Data [Jim Garrels, Proteome, Inc. www.proteome.com] Literature 50 billion bytes of text covering the 5 billion bytes in Genbank. BioKnowlede Library curated by expert -- Proteome DB is free • title with brief functional description, • family • properties (mutant phenotype, ...} • sequence annotations, • related proteins: Orthologs and Interlogs (in different species) • classification • integrated from cDNA microarrays and chips, systematic 2-hybrids, … . Model-organisms: Started with Yeast, now worms [Stuart Kim, Stanford], Several 1000 physical associations and interactions. Authors should not publish experimental data directly into a DB and curate their own papers, but submit their results and expression studies? How to deal with updates of their own results? resubmit? Need mediating portal sites a well as content sites. 7/26/2016 EPFL7B - Gio spring 2000 26 Relate Genes to what is happening [Dong-Guk Shin, Univ. Connecticut shin@engr.uconn.edu] Virtual Cell Project: Cell Physiology modeling, NIH supported: also available without DB support, from www.nrcam,uchc.edu Identify gene functein (I.e., protein-generation) in cells Bottom-up approach to cell modeling Cross checking of models and Hypotheses Geometry obtained from segmented images 2-D Visualization of specified reactions: channels, pumps, for extra, intra (cytosol), in core cellular compartments. Generates equations for simulation. Result is a DB publication cycle, supporting model copying & adaptation. For access to remote users need more than a browser, but also a query system, with join over association. DBs need APIs and mediation for scalability and mismatch. 7/26/2016 EPFL7B - Gio spring 2000 27 4.5 Aspects for Interoperability of Databases [Daniel Gardner, Cornell University. cortex.med.cornell.edu] 1. user - platforms, software, open to new data: model journal to define scope and views, but include data - re-analyzable. Data quality is domaindependent. Data sets presented via a virtual oscilloscope. 2 common datamodel (XML based, with capability for interdomain queries.) for neuroscience. – Hierarchical with a controlled vocabulary, for selected granularity. – Much metadata, (physiological site, data, reference, method and model elements) used in query term as well. – Data compatability - federated, and evolving. 3 TEMPORAL - legacy, current, future (IBN card -- XML) 4 Technical - Proprietary versus open (as PNAs papers) 4.5 Domain specific versus interdisciplinary. just interfaces. XML BDML for brains. Will be longer lived than CORBA. <<the problem of interoperation is not the syntax ox XML, but the semantics of the DTD tags, Scalability beyond neurosciences. Federation versus articulation>> 7/26/2016 EPFL7B - Gio spring 2000 28 Databases are supplanting journals, but … [Peter Karp: www.ai.sri.com/pkarp/mmdb/94/] Progress in interoperation Databases are re-analyzable, important for validation, extension. Results published in journals are not. Estimate now about 500 public databases for Bioinformatics. Want seamless interoperation. Problems: Differing models, some are just irregular flat files Various units of measurements, leading to semantic errors. Much text (SRS) vs. Structured information Not all have APIs, nor web APIs DBMS lack ontologies, no formal model, inconsistent semantics (example even in Genbank entries), often don’t have the right fields (SwissProt infered versus being observed. Maintenance poor over time. Warehouse versus multi-databases? 7/26/2016 EPFL7B - Gio spring 2000 29 Ecoli Metabolic Databases EcoCyc [Peter Karp, SRI Int., Bioinformatics Res.Group, pkarp@ai.sri.com] Ecocyc database contain 150 metabolic pathways known in Ecoli. with cross references to Genbank, literature, evaluation Not only relevant for Ecoli : These pathways are also found in other species locatable by gene matching also HincCyc for H.influenca Virus (proprietary?) To provide cross-linking used by other (mediation) projects K2 at Upenn, OPM by Gene Logic, Hyperlinkng at SB-Glaxo. Proposed XOL= ontology exchange language. 7/26/2016 EPFL7B - Gio spring 2000 30 Flybase [ William Gelbart, Harvard University] Flybase collects more than just the fruitfly gene sequence, namely exons and their mutations. Tranposon insertion sites. Moving from being Hunter Gatherers in science to Harvesters, moving to an agronomical society, << requires new laws >> Phenome <--> complexome -->Genome <-- transciptome -->> Preteome. Clasical genomics is being superseded by Expression and Interaction of gene products and gene perturbation <-- --> phenotypes. How do we organize DBs for that objective? Many sorting methods Things {biological objects, relationships among the objects -- with sources } -> robust object classifiers with controlled vocabularies. Foundation DBs vs Derived DBs -- define ownership of source DBs. • Histories must be maintained. • Version tracking. • Presentation standards. 7/26/2016 EPFL7B - Gio spring 2000 31 Electronic Publication [Brian Ray, American Assoc.for the Advancement of Science] Most journals require submission of gene sequences into <> Papers only summarize, and describe process But what will journals look like now? The Signal Transduction Knowledge Environment: www.stke.org STKE: Virtual journal, developed jointly with High-wire Press: Using the web for summarizing relevant articles from other (electronic journals) A prototype for a future publication model: all academic papers are placed into a pile, and classified into one or more discipline categories, and aggregated and retrieved by secondary specialists - a new role for editors, requiring scientific competence and authority. Maintains a pathway map for attaching Has a controlled vocabulary. Does caching of retrieved referenced Medline articles. 7/26/2016 EPFL7B - Gio spring 2000 32 Larger-scale units of Biological Data [Stephen Koslow, Office on Neuroinformatics, NIMH] The human brain has • • • 100 billion (10^14) neural cells, dozens of cell types. 10^15 connections. uses 15 Watts. Voluminous 3-D MRI data, at higher granularity. Basis for localization of diagnostic EEG, MEG observations. Neuroscience is a growing field, includes neuroinformatics. Has initial, broad journals, reductionist journals, Numerical, symbolic, literature and image data. Volume of publication only for serotonin, discovered in 1948, now 70 000 papers, is becoming impossible to follow. See UCLA brain mapping project for basis data -- normal brain. [www.nimh.nih.gov/neuroinformatics/index.cfg] 7/26/2016 EPFL7B - Gio spring 2000 33 Modeling & Simulation [James Bower, California Institute of Technology] Modeling and simulation of Purkinje cells Purkinje cell (6 M in human) 100 micro meters, has 250 000 inputs, 10-12 distinct conductances, modeled by Eric Schoeter [now Belgium] . Tested with electrical probes. • Found differences with publ.information: here the dendrite is current sink. – Rethinking of cerrebellum. – It is a sensory device, not a motor control device. – Shown by experiments motor and sensing, and observing brain activity. • Still linking images and actual activity of neurons in that area is h ard. • Cognitive- sytem- network- cellular - subcellular -molecular atomic, Web site, Purkinje Park, allows ongoing collaboration with students, www.whyville.net/index.html - kids learning relationships among levels Correponding simulators: Computer Science ACT SOAR (connects 2 levels)GENESIS (4 levels)neural nets NEURON (2)-MCELL/VCELL (2) are very simplistic RASMO/WebLlab GEPASI/GAMESS/Psl. 7/26/2016 EPFL7B - Gio spring 2000 34 Analytical Approaches [Douglas Brutlag, Stanford University] Applications of Data Mining, Many types of relevant DB for Sequence, sequence variation information Now also • relationship DBs.(phylogenetic, • gene fusion [Eisenberg], • pathways, • gene expression, • protein-ligand, • signal transduction Challenge: finding them, syntax, semantics (MESH inadequate), Doubletwist [Pangea] - an agent-based domain-specific journal summaries and notifications of subsequent published findings. 7/26/2016 EPFL7B - Gio spring 2000 35 Conclusion: Data, and Models to represent understanding of data Sharing and Publishing electronically at two levels 1. Sources, I.e.: data -- with provenance - incl. predictions, fixes. • recognize owners’ objectives - they may not be your objectives, (PDB does not list all binding sites found - lack of motivation ) 2. Models, incorporating knowledge, with means to populate the model 3. Added value by secondary processing. - shared ownership (c) • Expanding on Prof. Gelbart’s example by moving from agronomic to the medieval guilds -- the predecessors of professional societies -sitting around the market square, where the farmers deliver their source, as wholesalers and intermediaries. • Well maintained derived databases also have value: added value by expertise focused on some objective. 7/26/2016 EPFL7B - Gio spring 2000 36 Integration summary • A focus of Knowledge generation is integration of data • The problem of interoperation is not the syntax ox XML, but the semantics of the DTD tags. – Scalability beyond neurosciences. – Federation versus articulation. – XMLdebabelizer. • Yes keep the fundamental sources, but get added value in derived data (as Swiss Prot): – error correction for a specific objective (U Penn.work), – adding entries – Does not require federation and terminological alignment of all sources. • Rules and ontologies provide incremental help. – help much, but don’t solve problems of semantic errors 7/26/2016 EPFL7B - Gio spring 2000 37 The People Problem The demand for people in bioinformatics is high at all levels • Critical is a lack of – training opportunities - programs and teachers – available trainees • Being in multi-disciplinary field is scary – tenure for faculty – load for students – salary and growth differentials in biology and CS • Some institutions are moving aggressively • [Caltech, U Penn, EPFL?] – must compete with World-Wide Web visions 7/26/2016 EPFL7B - Gio spring 2000 38 Bioinformatics needs Ethics Knowledge carries responsibilities. also, always some error rates How will people feel about your knowledge about them? their genetic make-up, physical & psychological propensities. Privacy is hard to formalize, but that does not mean it is not real to people. Perceptions count. (There is also real stuff insurance scams - personal relations ) Diagnostics without therapies. 7/26/2016 EPFL7B - Gio spring 2000 39