7. Bioinformatics Intelligent Information Systems

advertisement
Intelligent Information Systems
7. Bioinformatics
based on NRC* Bioinformatics workshop on Data Integration,
Washington DC, February 2000
to be published as an NRC/NAS report
Gio Wiederhold et al.
EPFL,
April-June 2000, at 14:15 - 15:15, room INJ 218
*NRC = National Research Council,
Analysis and publication arm of the
U.S. National Academy of Sciences
7/26/2016
EPFL7B - Gio spring 2000
1
Schedule
Presentations in English -- but I'll try to manage discussions in French and/or German.
• Material covered in an integrating fashion, drawing from concepts in databases,
artificial intelligence, software engineering, and business principles.
1. 13/4 Historical background, enabling technology:ARPA, Internet, DB, OO, AI., IR
2. 27/4 Search engines and methods (recall, precision, overload, semantic problems).
3. 4/5 Digital libraries, information resources. Value of services, copyright.
4. 11/5 E-commerce. Client-servers. Portals. Payment mechanisms, dynamic pricing.
5. 19/5 Mediated systems. Functions, interfaces, and standards. Intelligence in
processing. Role of humans and automation, maintenance.
6. 26/5 Software composition. Distribution of functions. Parallelism. [ww D.Beringer]
7. 31/5 Application to Bioinformatics.
----------- break --------8. 15/6 Resolving semantic heterogeinty. Educational challenges.
9. 22/6 Privacy protection and security. Security mediation.
10.29/6 Summary and projection for the future.
• Feedback and comments are appreciated.
7/26/2016
EPFL7B - Gio spring 2000
2
Bio-Information
• to learn about ourselves,
– our origins, our place in the world
• Humans, other Primates, Mice, Zebrafish,
Fruit Flies* (drosophilae), Roundworms* (c.elegans),
viruses as HIV*, Yeast*, plants as corn*
– modesty, seeing how much we share with all organisms
– not just of philosophical interest, but also
• to help humanity to lead healthy lives
– to create new scientific methods
– to create new diagnostics
– to create new therapeutics
7/26/2016
EPFL7B - Gio spring 2000
* substantially/completely sequenced.
also bacterium* (Haemophilus influenzae)
3
Bioinformatics
Information systems applied to biology and healthcare
• Biomedical statistics
Image data,
• …, …
Genomics - information related to gene-derived data

• an subset of major interest
• boundary often unclear: nature versus nurture & lifestyle
– A person’s Genomic make-up has a major effect on
susceptibility to diseases: positive and negative
smoking,
– Major genomic errors prevent birth, hence
exposure to smoke
& lungcancer
– we deal with differences that are relatively minor
289 / ~10 000 genes suspected/identified
– complexity: most health effects are also combinatorial
multiple genes, promotors, inhibitors, metabolic cross-roads
7/26/2016
EPFL7B - Gio spring 2000
4
Integrating knowledge
Meeting in February 2000 bought together biologists and computer
scientists from academia, industry and government to discuss salient
issues in biological computing.
The following topics were covered:
• the generation and integration of biologic databases;
• interoperability of heterogeneous databases;
• integrity of databases;
• modeling and simulation,
• data mining,
and
• visualization of "model fit” to data
No single person can cover -- understand -- it all anymore
Report of Feb. 2000 meeting to be published by NAS press
7/26/2016
EPFL7B - Gio spring 2000
5
Players
•
Human Genome project
(NIH-NCGI & Wellcome trust)
$250M,
1988-- 2005, but likely roughly in completed in 2000/2001?
–
–
•
Technology and strategies caused exponential rates of improvement
–
–
–
–
•
work at Universities, related research labs, split per 24 chromosomes
collected in public databases www.ncbi.nlm.nih.gov/genome/seq
•
100 M in 1998
(well annotated, with paper publishing)
• 2.100 M by March 2000. ~12,000 base pairs per day in 1999.
PCMR allows and multiplication of strands based on partial initial tags
automation [Perkins-Elmer Biosystems, Affymetrix…]
piece-wise (100-1 000) analysis and subsequent assembly versus walking the gene
pieces overlap, software to match
Private enterprises at various levels
–
–
–
–
7/26/2016
not-for-profit [The institute for Genomic Research (TIGR) dir. Craig Ventner]
for profit [Celera Genomics (Ventner), Incyte] sell leads to pharmaceutical companies
Early discovery pharmaceuticals [HGS Inc, Millenium Ph.]
Established Pharmaceutical companies in-house [all now],support
drug development, trials on animals humans, {toxicity, then benefit} trials, marketing.
EPFL7B - Gio spring 2000
6
Quantities
Progress
1
human
The human genome: ~ 3 200 000 000 base pairs
< 10 000
proteins ?
diseases
Genes, and gene abnormalities
6 000 000 000
humans
Everybody’s genes
<1000
systems
Metabolic pathways
~2 000 000
molecules
Small organic molecules - affect proteins - suitable for drugs
7/26/2016
EPFL7B - Gio spring 2000
7
Relationships
• Basepairs: certain pairs of 4 amino acids: ACGT
• adenine, cytosine, guanine, thymine,
combine in double helix
• 3 basepairs define 12 amino acids < (43 = 64)
• Proteins:
–
–
–
–
determined by certain sequences O(100) of amino acids: genes
assembled by Ribosome according to RNA template from DNA
coded in ~3% of the DNA sequence -- but where?
97% is miscellaneous: promotors / inhibitors / historical junk
multiple genes for many proteins provide redundancy?
7/26/2016
EPFL7B - Gio spring 2000
8
Decoding the DNA sequence
• Multiplying the source material
– PCMR -- matches start, copies
• Walking the sequence
– using enzymes to cut, match by next piece
– extract characteristics from segments by chromotography
• Shotgun approach
– cut into many segments (EST), will overlap arbitrarily
– analyze automatically (automated chromatography)
– assemble permutations and determine by best match
• Determine the likely function by best match with known genes
– BLAST-based tools, similar structure implies similar function
7/26/2016
EPFL7B - Gio spring 2000
9
Matching of sequences
• Difficult because of
– errors in source amino-acid sequence
– missing subsequences, extra strands
– meaningful variation:
• HIV reverse transcriptase (RT) & protease is characterized by
many mutations
[http://hivdb.Stanford.edu]
– Variation in regions that do not code for proteins
– Loops and repeats in sequences
• Several tools: BLAST family, GRAIL
– search for similarities
– can create errors
7/26/2016
EPFL7B - Gio spring 2000
10
Disease specific
Same process with organ cells from diseased person(s)
• Problem: cells carry complete DNA
– look for family traits
• Iceland study
– compare to presumed healthy person’s DNA
• many differences are irrelevant
• Protein concentrations in cells differ
– identify tissue samples to localize likely actions
– test metabolic susceptibility in those cells to narrow functions
Requires mix of computer an biological competence
7/26/2016
EPFL7B - Gio spring 2000
11
Diagnostics versus Drugs
Diagnostic:
• Analysis of individuals’, family members makeups
• Differences between diseased and normal persons
• simple case: 1 gene abnormality <-- -> 1 disease
• common: several genes, related to same protein -->
• complex: mixed metabolic paths, multiple proteins -->
• affected by nutrition (diabetes), lifestyle, random events
• precursor for understanding, intervention --- --- --->
Drug development?
7/26/2016
EPFL7B - Gio spring 2000
12
Disease targeting
• Early stage of drug development
Now have an excess of targets
(reverse of situation in pharmaceutical companies 5 years ago)
Find chemicals that affect those targets -- block or enhance
– from corporate libraries -- with known attributes
– from methodically created variations - chemical chips
• high-throughput screening - 10 000/day
– if significant, identify, produce samples for further tests
• Still takes a long time to have marketable drugs
• toxicity tests in cells, in animals, humans
• effectiveness tests in animals, humans
•
failure rate 80%?
failure rate 90%?
Many pathways affects many diseases. Balance and mix?
7/26/2016
EPFL7B - Gio spring 2000
13
2D to 3D conversion
Understand actual interaction effects requires 3-D models
Protein folding
• Strand of DNA, template for RNA, becomes protein,
assumes a tight, 3_D shape
• The shape determines the attachment points to cells,
– nature does folding in a few nanoseconds
– computation based on finding minimum energy conformations
would take many years
– current research tries to break computation by recognizing
common substructure types: alpha-helixes, beta sheets, …
Hardest genomics research issue today
7/26/2016
EPFL7B - Gio spring 2000
14
Use of 3-D configuration
Match protein or derivative to cell surface
• Attachment points
– nooks and crannies: zincfingers, sockets
– permeability of membranes / cell walls for certain proteins
• Tools
– Visualization
– Docking programs
– Computation of fit (minimal energy?)
• Research limited by lack of knowledge
– 3-D configuration of proteins and cells
– effect of in-vivo deformations
7/26/2016
EPFL7B - Gio spring 2000
15
Identification
• Match patterns of two samples
– label amino acids with fluorescent markers
– does not require functional genomic knowledge
– PCMR multiplies sample size
– Fluorescent activated cell sorters can separate cells,
Ex.:separate embryo cells from mother’s blood
by labeling with father’s genes and matching
– Familial ties, human migrations, ...
child that died in French prison was Louis XVII by tissue
comparison with current relatives
– Ancestry of species by creating hierarchical difference trees
uses “junk portions” of genome - functions no longer needed
7/26/2016
EPFL7B - Gio spring 2000
16
Multiple Representations
propri. propri.
propri.
propri. Genbank
propri. text & structured
Chem
DNA Strings
ProteinDB
Chemical
structures
2D - 3D
7/26/2016
Descriptions
& Statistics of
disease/normal
cases
5 billion
bytes
Literature:
50 billion bytes
of text covering
Genbank.
Bibliogra
phic
Citations
Essential
Det
EPFL7B - Gio spring 2000
ail
public
hospitals
corporate
Family
Traces
17
Heterogeneity inhibits Integration
• An essential feature of science
– autonomy of fields
– differing granularity and scope of focus
– growth of fields requires new terms
• A feature of technological process
– standards require stability -- not seen now in genomics
– yesterday’s innovations are today’s infrastructure
• Must be dealt with explicitly
– sharing, integration, and aggregation are essential
– large quantities of data require precision
• Precision is critical -– whenever we deal with 100 000’s of instances, even a 1%
false positive rate means following up on 1000 false leads.
When those leads are people we must be extra careful.
7/26/2016
EPFL7B - Gio spring 2000
18
Clinical: Diagnosis
Diagnosis is more advanced than treatment
• Match patient tissue sample pattern to rich pattern
– VLSI technology used to place 10 000 known genes on a chip surface
– look for matches of expressed genes vs expectations in cells from
diseased tissue (skin for melanoma, …)
– can distinguish, say, cancers, that require specific treatment,
but are indistinguishable by pathologists
• Follow with
– traditional treatments, if any
– but earlier / more aggressive / more specific
– being careful
– haemophilia
– being emotionally more prepared
7/26/2016
EPFL7B - Gio spring 2000
19
Clinical Treatment
Only few choices now, take many years to develop, test
Two ways to get good genes to work
• in vivo -- problem: rejection
• put virus (can penetrate cells) with repaired gene into cells
• those cells now generate proper protein
• expect cells to replicate, and create more protein
• in vitro
-- problem: getting protein to right places
• use bacteria to replicate gene
• let them manufacture needed proteins
• inject proteins
7/26/2016
EPFL7B - Gio spring 2000
20
Clinical Treatments 2
Or, block bad genes ,
all in vivo -- problem: knowledge, getting there
• flood area with decoy promotors
– fool the ribosome, prevent transcription from DNA to RNA
• block RNA from being a model for more DNA
– use anti-sense molecules to create wrong double helix segments
• stiffle cells by synthetic antibodies (for cancers)
– block growth factor attachment for its proteins, by providing fakes
7/26/2016
EPFL7B - Gio spring 2000
21
Integration projects and topics
Meeting was intended to bring people together
• presentation present individual projects and concerns
can’t do anything else if you want to be real
Follow-up?
Notes that
follow are a
• Individuals
personal record,
• Funding evaluation
not a formal
transcript.
• Specific interoperation projects?
Public versus proprietary interests
• Government funding is much less than industrial funding
• Where is leverage for interoperation?
7/26/2016
EPFL7B - Gio spring 2000
22
Database Annotation: adds meaning
[Chris Overton, U Penn]
Genome annotation
original/subsequent - know Provenance
•
Provides links to data sources and to to encoded proteins.
•
predicts and archives landmarks.
Genbank majority of entries have annotation ambiguity.
Poor advice of changes other than to sequence.
PDB does not list all binding sites found in proteins Lack of motivation/confidence of authors? [Weissig, Bioinformatics 99],
Errors come from
• experimental data
• manual curation from the literature,
• computational predictions (Grail uses Neural nets)
• propagated increasingly in computation and integration.
K2 mediator project (GAIA DB) [Chris Overton, Univ. of Pennsylvania]
Uses Genbank, SwissProt, TRRD , GERD, TRANSFAC, MEDLINE.
(some have moderately or highly restrictive licenses)
Looks for syntactic errors, using a formal grammar for eukaryotic genes
matching introns and exons (implied in GD), also actual coding regions.
propagated spelling problems
7/26/2016
EPFL7B - Gio spring 2000
23
Database Correctness
[Bill Anderson, Knowledge Bus Inc, Hanover MD]
Develop Methods for correcting errors `Debabelization’
for EML( European Media Lab.) and EMBL, (Heidelberg) :
All their work (Data Alive) is based on an ontology for biochemical
databases
Syntactic: errors: formats
Semantic: interpretation of relations -- ontology
Pragmatic errors - true data differences (experiment, transcriprition)
biochemical ontology --> microanatomy -->{spatial, events} , chemistry ->{spatial, events}--> (several 100 axioms as constraint rules)
Either the database or the constrainig ontology is wong)
When a fault occurs go back to pragmatics, no automatic curation.
7/26/2016
EPFL7B - Gio spring 2000
24
Database Curation
[Michael Cherry, SGD database,. genome-www.stanford.edu]
[Michael Ashburner for fly, mouses, and yeast (saccarides)]
Long-term quality control:
Curation is the act of establishing and maintaining a database, here
often the chromosome or species-specific databases.
Similar task to what a journal editor does,
Curator also functions as an Educator, Ontologist [Yaahoo term]
Learn what aids the community needs, and build the museum to satisfy
those needs [John Cotten Dana, 1850]
Set limits according to what you can do and obtain.
Find missing details in literature, include summary paragraphs.
Requires a gene ontology for molecular function,
Information on cellular location (absolute or relative),
Used for annotating results from microarrays.
7/26/2016
EPFL7B - Gio spring 2000
25
Organizing Genomic Data
[Jim Garrels, Proteome, Inc. www.proteome.com]
Literature 50 billion bytes of text covering the 5 billion bytes in Genbank.
BioKnowlede Library curated by expert -- Proteome DB is free
• title with brief functional description,
• family
• properties (mutant phenotype, ...}
• sequence annotations,
• related proteins: Orthologs and Interlogs (in different species)
• classification
• integrated from cDNA microarrays and chips, systematic 2-hybrids, … .
Model-organisms: Started with Yeast, now worms [Stuart Kim, Stanford],
Several 1000 physical associations and interactions.
Authors should not publish experimental data directly into a DB and curate their own
papers, but submit their results and expression studies?
How to deal with updates of their own results? resubmit?
Need mediating portal sites a well as content sites.
7/26/2016
EPFL7B - Gio spring 2000
26
Relate Genes to what is happening
[Dong-Guk Shin, Univ. Connecticut shin@engr.uconn.edu]
Virtual Cell Project: Cell Physiology modeling,
NIH supported:
also available without DB support, from www.nrcam,uchc.edu
Identify gene functein (I.e., protein-generation) in cells
Bottom-up approach to cell modeling
Cross checking of models and Hypotheses
Geometry obtained from segmented images
2-D Visualization of specified reactions:
channels, pumps, for extra, intra (cytosol), in core cellular compartments.
Generates equations for simulation.
Result is a DB publication cycle, supporting model copying & adaptation.
For access to remote users need more than a browser, but also a query
system, with join over association.
DBs need APIs and mediation for scalability and mismatch.
7/26/2016
EPFL7B - Gio spring 2000
27
4.5 Aspects for Interoperability of Databases
[Daniel Gardner, Cornell University. cortex.med.cornell.edu]
1. user - platforms, software, open to new data: model journal to define scope
and views, but include data - re-analyzable. Data quality is domaindependent. Data sets presented via a virtual oscilloscope.
2 common datamodel (XML based, with capability for interdomain queries.) for
neuroscience.
– Hierarchical with a controlled vocabulary, for selected granularity.
– Much metadata, (physiological site, data, reference, method and
model elements) used in query term as well.
– Data compatability - federated, and evolving.
3 TEMPORAL - legacy, current, future (IBN card -- XML)
4 Technical - Proprietary versus open (as PNAs papers)
4.5 Domain specific versus interdisciplinary. just interfaces.
XML BDML for brains. Will be longer lived than CORBA.
<<the problem of interoperation is not the syntax ox XML, but the semantics of the DTD
tags, Scalability beyond neurosciences. Federation versus articulation>>
7/26/2016
EPFL7B - Gio spring 2000
28
Databases are supplanting journals, but …
[Peter Karp: www.ai.sri.com/pkarp/mmdb/94/]
Progress in interoperation
Databases are re-analyzable, important for validation, extension.
Results published in journals are not.
Estimate now about 500 public databases for Bioinformatics.
Want seamless interoperation.
Problems:
Differing models, some are just irregular flat files
Various units of measurements, leading to semantic errors.
Much text (SRS) vs. Structured information
Not all have APIs, nor web APIs
DBMS lack ontologies, no formal model,
inconsistent semantics (example even in Genbank entries),
often don’t have the right fields (SwissProt infered versus being
observed.
Maintenance poor over time.
Warehouse versus multi-databases?
7/26/2016
EPFL7B - Gio spring 2000
29
Ecoli Metabolic Databases EcoCyc
[Peter Karp, SRI Int., Bioinformatics Res.Group, pkarp@ai.sri.com]
Ecocyc database contain 150 metabolic pathways known in Ecoli.
with cross references to Genbank, literature, evaluation
Not only relevant for Ecoli :
These pathways are also found in other species
locatable by gene matching
also HincCyc for H.influenca Virus (proprietary?)
To provide cross-linking used by other (mediation) projects
K2 at Upenn,
OPM by Gene Logic,
Hyperlinkng at SB-Glaxo.
Proposed XOL= ontology exchange language.
7/26/2016
EPFL7B - Gio spring 2000
30
Flybase
[ William Gelbart, Harvard University]
Flybase collects more than just the fruitfly gene sequence,
namely exons and their mutations. Tranposon insertion sites.
Moving from being Hunter Gatherers in science to Harvesters, moving to
an agronomical society, << requires new laws >>
Phenome <--> complexome -->Genome <-- transciptome -->> Preteome.
Clasical genomics is being superseded by Expression and Interaction of
gene products and gene perturbation <-- --> phenotypes.
How do we organize DBs for that objective?
Many sorting methods
Things {biological objects, relationships among the objects -- with
sources } -> robust object classifiers with controlled vocabularies.
Foundation DBs vs Derived DBs -- define ownership of source DBs.
• Histories must be maintained.
• Version tracking.
• Presentation standards.
7/26/2016
EPFL7B - Gio spring 2000
31
Electronic Publication
[Brian Ray, American Assoc.for the Advancement of Science]
Most journals require submission of gene sequences into <>
Papers only summarize, and describe process
But what will journals look like now?
The Signal Transduction Knowledge Environment: www.stke.org
STKE: Virtual journal, developed jointly with High-wire Press: Using
the web for summarizing relevant articles from other (electronic
journals)
A prototype for a future publication model: all academic papers are
placed into a pile, and classified into one or more discipline
categories, and aggregated and retrieved by secondary specialists
- a new role for editors, requiring scientific competence and
authority. Maintains a pathway map for attaching Has a controlled
vocabulary. Does caching of retrieved referenced Medline articles.
7/26/2016
EPFL7B - Gio spring 2000
32
Larger-scale units of Biological Data
[Stephen Koslow, Office on Neuroinformatics, NIMH]
The human brain has
•
•
•
100 billion (10^14) neural cells, dozens of cell types.
10^15 connections.
uses 15 Watts.
Voluminous 3-D MRI data, at higher granularity.
Basis for localization of diagnostic EEG, MEG observations.
Neuroscience is a growing field, includes neuroinformatics.
Has initial, broad journals, reductionist journals,
Numerical, symbolic, literature and image data.
Volume of publication only for serotonin, discovered in 1948,
now 70 000 papers, is becoming impossible to follow.
See UCLA brain mapping project for basis data -- normal brain.
[www.nimh.nih.gov/neuroinformatics/index.cfg]
7/26/2016
EPFL7B - Gio spring 2000
33
Modeling & Simulation
[James Bower, California Institute of Technology]
Modeling and simulation of Purkinje cells
Purkinje cell (6 M in human) 100 micro meters, has 250 000 inputs, 10-12 distinct
conductances, modeled by Eric Schoeter [now Belgium] .
Tested with electrical probes.
• Found differences with publ.information: here the dendrite is current sink.
– Rethinking of cerrebellum.
– It is a sensory device, not a motor control device.
– Shown by experiments motor and sensing, and observing brain activity.
• Still linking images and actual activity of neurons in that area is h ard.
• Cognitive- sytem- network- cellular - subcellular -molecular atomic,
Web site, Purkinje Park, allows ongoing collaboration with students,
www.whyville.net/index.html - kids learning relationships among levels
Correponding simulators:
Computer Science
ACT SOAR (connects 2 levels)GENESIS (4 levels)neural nets
NEURON (2)-MCELL/VCELL (2)
are very simplistic
RASMO/WebLlab GEPASI/GAMESS/Psl.
7/26/2016
EPFL7B - Gio spring 2000
34
Analytical Approaches
[Douglas Brutlag, Stanford University]
Applications of Data Mining,
Many types of relevant DB for Sequence, sequence variation
information
Now also
• relationship DBs.(phylogenetic,
• gene fusion [Eisenberg],
• pathways,
• gene expression,
• protein-ligand,
• signal transduction
Challenge: finding them, syntax, semantics (MESH inadequate),
Doubletwist [Pangea] - an agent-based domain-specific journal summaries and notifications of subsequent published findings.
7/26/2016
EPFL7B - Gio spring 2000
35
Conclusion:
Data, and Models to represent understanding of data
Sharing and Publishing electronically at two levels
1. Sources, I.e.: data -- with provenance - incl. predictions, fixes.
•
recognize owners’ objectives - they may not be your objectives,
(PDB does not list all binding sites found - lack of motivation )
2. Models, incorporating knowledge, with means to populate the model
3. Added value by secondary processing. - shared ownership (c)
•
Expanding on Prof. Gelbart’s example by moving from agronomic to
the medieval guilds -- the predecessors of professional societies -sitting around the market square, where the farmers deliver their
source, as wholesalers and intermediaries.
•
Well maintained derived databases also have value:
added value by expertise focused on some objective.
7/26/2016
EPFL7B - Gio spring 2000
36
Integration summary
• A focus of Knowledge generation is integration of data
• The problem of interoperation is not the syntax ox XML, but the
semantics of the DTD tags.
– Scalability beyond neurosciences.
– Federation versus articulation.
– XMLdebabelizer.
• Yes keep the fundamental sources, but get added value in derived
data (as Swiss Prot):
– error correction for a specific objective (U Penn.work),
– adding entries
– Does not require federation and terminological alignment of all sources.
• Rules and ontologies provide incremental help.
– help much, but don’t solve problems of semantic errors
7/26/2016
EPFL7B - Gio spring 2000
37
The People Problem
The demand for people in bioinformatics is high
at all levels
• Critical is a lack of
– training opportunities - programs and teachers
– available trainees
• Being in multi-disciplinary field is scary
– tenure for faculty
– load for students
– salary and growth differentials in biology and CS
• Some institutions are moving aggressively
• [Caltech, U Penn, EPFL?]
– must compete with World-Wide Web visions
7/26/2016
EPFL7B - Gio spring 2000
38
Bioinformatics needs Ethics
Knowledge carries responsibilities.
also, always some error rates
How will people feel about your knowledge about them?
their genetic make-up,
physical & psychological propensities.
Privacy is hard to formalize,
but that does not mean it is not real to people.
Perceptions count.
(There is also real stuff insurance scams - personal relations )
Diagnostics without therapies.
7/26/2016
EPFL7B - Gio spring 2000
39
Download