Bioinformatics: Converting Data to Knowledge Gio Wiederhold Stanford University

advertisement
TITech 13 Nov 2000
Bioinformatics:
Converting Data to Knowledge
Gio Wiederhold
Stanford University
Computer Science, E.E. & Medicine
http://www-db.stanford.edu/people/gio.html

Data
Aggregation
of instances
Integration
of sources
Knowledge
Analyses
Observations
Filters
• The product: Information
7/26/2016
Gio Wiederhold - TITech 2000
2
Bio-Information
• to learn about ourselves,
– our origins, our place in the world
Primates, Mice, Zebrafish, Fruit Flies, Roundworms, Yeast
– modesty, seeing how much we share with all organisms
– not just of philosophical interest, but also
• to help humanity to lead healthy lives
– to create new scientific methods
– to create new diagnostics
– to create new therapeutics
7/26/2016
Gio Wiederhold - TITech 2000
3
Loops of Data and Knowledge
Information is
created at the
Storage
confluence of
Education
data -- the state
Selection
Recording
&
knowledge -Integration
the ability to
select and
Abstraction
Experience
State changes
project the
Decision-making
state into
the future
Action
Knowledge Loop
7/26/2016
Data Loop
Gio Wiederhold - TITech 2000
4
Volume and Variety
Two interacting issues in the generating
information
1. The volume is large -we need automation
2. The data is varied & heterogeneous
• many autonomous sources
• many distinct objectives

7/26/2016
many incompatibilities, errors
Gio Wiederhold - TITech 2000
5
Nature
1
human
> 30 000
genes
~ 10 000
proteins
diseases
Quantities
Progress
The human genome: ~ 4 000 000 000 base pairs
Genes, and gene abnormalities
6 000 000 000
humans
Everybody’s genes
<1000
systems
Metabolic pathways
~2 000 000
molecules
Small organic molecules - affect proteins - suitable for drugs
7/26/2016
Gio Wiederhold - TITech 2000
6
Diversity  Heterogeneity
A wide variety of knowledge is needed to interpret the data
A large variety of experts is developing this knowledge
The scope of interests differs among those experts
The knowledge is expressed in diverse ways
The terms differs in precise meaning: semantics
A large variety of data types is needed
A wide variety of representations is used
The database and file schemas differ
A wide variety of representations is used
The openness and accessibility of the information differs
7/26/2016
Gio Wiederhold - TITech 2000
7
Scope differences
A scope difference exists when terms differ in
their mapping to real-world objects
employee (payroll)
disabled
employee(personnel)
all possible employees
contractors
The local objective determines scope
Example: “binding site” in PDB database [Waugh&Altman]
binding sites reported for publication
doubtful
all actual binding sites
reporting doubtful results risks rejection of publication
7/26/2016
Gio Wiederhold - TITech 2000
8
Heterogeneity inhibits Integration
• An essential feature of science
– autonomy of fields
– differing granularity and scope of focus
– growth of fields requires new terms
• A feature of technological process
– standards require stability
– yesterday’s innovations are today’s infrastructure
Must be dealt with explicitly
– sharing, integration, and aggregation are essential
– large quantities of data require precision
7/26/2016
Gio Wiederhold - TITech 2000
9
Heterogeneity among domains is natural
Interoperation creates mismatch
• Autonomy conflicts with consistency,
– Local Needs have Priority,
– Outside uses are a Byproduct
Heterogeneity must be addressed
• Platform and Operating Systems 4
4
• Data Representation and Access Conventions 4
• Metadata: Annotations, Naming, and Ontology
:
– needed to share data from distinct sources
7/26/2016
Gio Wiederhold - TITech 2000
10
Required precision = F(volume)
More precision is needed as data volume increases
--- a small error rate still leads to too many errors
False Positives have to be investigated
False Negatives cause
lost opportunities,
suboptimal to some degree
Information Wall
data errors
( attractive-looking supplier - makes toys
apparent drug-target with poor annotation )
information quantity
adapted from Warren Powell, Princeton Un.
7/26/2016
Gio Wiederhold - TITech 2000
11
Inconsistency causes errors,
while results need precision
False positives = poor precision
typically cost more than
false negatives = poor recall
Example: [ Todd Lowe
tRNA search <rna.wustl.edu/tRDB > ]
Search in Yeast for 55 methylation sites
-- required manual elimination of pseudogenes
Search space in human genome is 215 times larger, not yet done
In drug-discovery we have now more targets than
.
pharmaceutical companies can afford to investigate
7/26/2016
Gio Wiederhold - TITech 2000
12
Broad array of relatable sources
•
•
•
•
Genomic
Bibliographic
Demographic
Epidemiological
– Familial
– Contacts
• Clinical
– Drug effectiveness
– Drug-resistance
– Co-occurrence
7/26/2016
[ Many used in data-mining:
as PRM
(Probabilistic Relational
Model) research by
Lise Getoor @ stanford ]
Requires acyclicity.
Use temporal dependencies?
Gio Wiederhold - TITech 2000
13
Intersection of a large (irrelevant data)
and a small (good data) distribution.
Result
7/26/2016
The optimal separation
creates more
false positives
(irrelevant results )
than
false negatives
(good results missed)
Gio Wiederhold - TITech 2000
14
Quality of data verified through publication
Data characteristics project
[Stephen Koslow, Office on Neuroinformatics, NIMH
www.nimh.nih.gov/neuroinformatics/index.cfg]
The human brain uses 15 Watts; has dozens of cell types,
100 billion (10^14) neural cells, 10^15 connections.
Neuroscience is a growing field, includes neuroinformatics.
Intial, broad journals, reductionist journals, Numerical,
symbolic, literature and image data. Volume of publication
only for serotonin, discovered in 1948, now 70 000 papers, is
becoming impossible to follow.
Voluminous 3-D MRI data. UCLA brain mapping. Basis for
localization of diagnostic EEG, MEG observations.
7/26/2016
Gio Wiederhold - TITech 2000
15
Projects requiring manual curation
are domain specific
Virtual Cell Project
Dong-Guk Shin, Univ. Connecticut shin@engr.uconn.edu
also available without DB support, www.nrcam,uchc.edu
NIH supported: Physiology modeling,
NSF support: computational modeling approach.
Bottom-up approach to cell modeling: Cross checking of models and
HXs: Geometry from segmented images, 2-D visualization of specified
reactions: channels, pumps, for extra, intra (cytosol), of core cellular
compartments. Generates equations for simulation.
Result is a DB publication cycle, supporting model copying and
adaptation.
For access to remote DBs will need more than a browser, but also a
query system, with join over association. DBs need APIs and
mediation for scalability and mismatch.
7/26/2016
Gio Wiederhold - TITech 2000
16
Data integration in Literature
[ Jim Garrels, Proteome, Inc. www.proteome.cm - free ]
BioKnowlede Library, a portal site: with 50 billion bytes of text
covering the 5 billion bytes in Genbank.
Classification, curated by experts.
Pages {title with brief functional description, family, properties (Mutant
phenotype, ) , } sequence annotations, related proteins: Orthologs
and Interlogs (in different species) [Marc Vidal, MGH],
Integrated data from cDNA microarrays and chips, systematic 2-hybrids,
Model-organisms: First Yeast, now Worms [Stuart Kim, Stanford],
Several 1000 physical associations and interactions.
Authors should not publish experimental data directly into a DB and
curate their own papers, but submit their results and publish detailed
expression studies and update their own results.
7/26/2016
Gio Wiederhold - TITech 2000
17
Relationships among search parameters
perfect recall
100%
r = v.relevant
v.available
50%
0%
7/26/2016
p= v.relevant
v.retrieved
space of methods, ranked from best
Gio Wiederhold - TITech 2000
18
Means to achieve precision in text
Textual information - knowledge - complements
pure data-oriented searches as BLAST [Liu & Altman]
• Reduce redundancy
– omit similar results from alternate sources
reports, workshop papers, journals, books
• Reduce false positives
– recognize contextual domains *
• the same word refers to different object types
nail (carpentry, anatomy), miter (carpentry, religion)
• Abstract findings to higher levels
– Linguistic processing based on customer model
medical case studies have similar formats
7/26/2016
Gio Wiederhold - TITech 2000
19
Integration makes Semantic Mismatches visible
Information comes from many autonomous sources
• Differing viewpoints
(by source)
–
–
–
–
–
differing terms for similar items
{ lorry, truck }
same terms for dissimilar items
trunk ( luggage, car )
differing coverage
vehicles ( DMV, police, AIA )
differing granularity
trucks (shipper, manuf.)
different scope
student (museum fee, Stanford )
• Hinders use of information from disjoint sources
– missed linkages
– irrelevant linkages
loss of information, opportunities
overload on user or application program
• Poor precision for interoperation
ok for web browsing poor for business and science
7/26/2016
Gio Wiederhold - TITech 2000
20
Shared Knowledge Base
PharmGKB – PharmacoGenetics Knowledge Base starting 2000
“An Ontology for Genetic Information” [Russ Altman]
<pharmgkb.org> based at Stanford, funded by NIGMS
to link existing projects – but open to others.
Phenotype variation --> Genotype variation
•
•
•
•
•
•
Phase 2 metabolizing enzymes – R.Weinshllboum at Mayo Clinic
Asthma -- Weiss (was Jeff Raizin) at Havard Un.
Anti-cancer agents -- Mark Ratain at Un. of Chicago
Membrane Transporters -- Kathleen Giacomini, UCSF
Tomoxifen metabolic activation -- Dave Flockhart at Georgetown Un.
Minority Populations and Privacy – M.Rothstein at Univ of Houston 
•
•
Depression in Mexican-Americans -- J.Licinio at UCLA
Database Tools -- Prakash Nadkarni at Yale Un.
7/26/2016
Gio Wiederhold - TITech 2000
21
Complex Relationships
Genomic
information
Isolated
functional
measures
Pharma.
activity
Drug
response
systems
Clinical
phenotype
Physiology
Coding
Molecules
Molecular
& cellular
phenotype
Integrated
functional
measures
Protein
Products
Obser vable
pheno types
Genetic
Makeup
Alleles
Mole cular
Varia tion
Drugs
Individuals
Non-genetic
factors
Environment
courtesy of R.Altman &Teri Klein, PhamGKB
7/26/2016
Gio Wiederhold - TITech 2000
22
PharmGKB
• Ontology for pharmacogenetics
– Represented in Protégé
[Musen: smi.stanford.edu/project/protege]
• Service for Universities and Industry
• open access to information and tools, but not a warehouse
– Industrial affiliates contributors and consumers at larger scales:
•
•
•
•
geneticXchange
Merck Co
Pharmacia
SmithKline-Beecham ( & Glaxo-Wellcome )
• Collaboration in larger topics:
GeneLogic
Guidant
Doubletwist
Incyte
Informax
SGI
Sun
– Biotechnology -- Clark Center
– Education -- NIH sponsored training program, new UG degrees
7/26/2016
Gio Wiederhold - TITech 2000
23
Consistency: global or partial ?
• Global consistency
+
–
–
–
–
wonderful for users and their programs
too many interacting sources
long time to achieve, 2 sources (UAL, LH), 3 (+ trucks), 4, … all ?
costly maintenance, since all sources evolve
no world-wide authority to dictate conformance
• Domain-specific ontologies XML DTD assumption
+
+
+
+
–
–
Small, focused, cooperating groups
high quality, some examples - arthritis, Shakespeare plays
allows sharable, formal tools
ongoing, local maintenance affecting users - annual updates
poor interoperation, users still face inter-domain mismatches
periodic source updates need automation in interoperation
7/26/2016
Gio Wiederhold - TITech 2000
24
Stanford Infolab SKC project
( Scalable Knowledge Composition )
Objective: High precision in semantic
interoperation of autonomous sources
• Basic -- pessimistic -- assumption:
– The ontological mapping of terms  objects
differs between autonomous domains.
• But
– The collections of real-world objects provides a
grounding for the definitions, and an
opportunity to validate the meaning of the
terms being employed.
– Relationships have semantic and a related
structural significance.
7/26/2016
Gio Wiederhold - TITech 2000
25
Exploit Domain-specific Expertise
.
Knowledge needed is huge
in science and in business
• Partition into natural
domains
• Determine domain
responsibility and
authority
• Empower domain owners
• Provide tools
Consider interaction
7/26/2016
Gio Wiederhold - TITech 2000
Society of
specialists
Society of
specialists
Society of
specialists
26
SKC grounded definition
.
• Ontology:
a set of terms and their relationships
• Term:
a reference to real-world and abstract objects
• Relationship:
a named and typed set of links between objects
• Reference:
a label that names objects
• Abstract object:
a concept which refers to other objects
• Real-world object:
an entity instance with a physical manifestation
7/26/2016
Gio Wiederhold - TITech 2000
27
Sample Operation: INTERSECTION
Result contains
shared terms,
useful for purchasing
Articulation
Source Domain 1:
Owned and maintained
by Store
7/26/2016
Source Domain 2:
Owned and maintained
by Factory
Gio Wiederhold - TITech 2000
28
An Ontology Algebra
A knowledge-based algebra for ontologies
Intersection
Union
Difference
create a subset ontology
keep sharable entries
create a joint ontology
merge entries
create a distinct ontology
remove shared entries
The Articulation Ontology (AO) consists of
rules that link domain ontologies
7/26/2016
Gio Wiederhold - TITech 2000
matching
29
INTERSECTION support
Articulation ontology
Terms useful
for purchasing
Matching
rules that use
terms from the
2 source domains
Store
Ontology
7/26/2016
Gio Wiederhold - TITech 2000
Factory
Ontology
30
Other Basic Operations
UNION: merging
entire ontologies
DIFFERENCE: material
fully under local control
Articulation
ontology
typically prior
intersections
7/26/2016
Gio Wiederhold - TITech 2000
31
Sample Operation: INTERSECTION
Result contains
shared terms,
useful for purchasing
Articulation
Source Domain 1:
Owned and maintained
by Store
7/26/2016
Source Domain 2:
Owned and maintained
by Factory
Gio Wiederhold - TITech 2000
32
Tools to create articulations
Graph matcher
for
Articulationcreating
Expert
Transport
ontology
Vehicle
ontology
Suggestions
for articulations
7/26/2016
Gio Wiederhold - TITech 2000
33
continue from initial point
Also suggest similar terms
for further articulation:
• by spelling similarity,
• by graph position
• by term match nexus
Expert response:
1. Okay
2. False
3. Irrelevant
to this articulation
All results are recorded
Okay ’s are converted into articulation rules
7/26/2016
Gio Wiederhold - TITech 2000
34
Candidate Match Nexus
Term linkages automatically extracted from 1912 Webster’s dictionary *
Notice presence
of 2 domains:
chemistry, transport
Based on processing
headwords  definitions
using algebra primitives
7/26/2016
* free; have processed the
OED (Oxford English Dictionary)
at Stanford for internal use
Gio Wiederhold - TITech 2000
35
Using the Match Nexus
Experiment:
On government structures of
NATO countries:
SKEIN system resolved
over 70% of unmatched terms
7/26/2016
Gio Wiederhold - TITech 2000
36
Using the Match Nexus
7/26/2016
Gio Wiederhold - TITech 2000
37
Features of an algebra
Operations can be composed
Operations can be rearranged
Alternate arrangements can be evaluated
Optimization is enabled
The record of past operations can be
kept and reused when sources change
7/26/2016
Gio Wiederhold - TITech 2000
38
Knowledge Composition
Composed knowledge for
applications using A,B,C,E
Articulation
knowledge
(A B) U
(B C) U
(C E)
Articulation
knowledge
(C E)
U
U
U : union
: intersection
U
Knowledge
resource
E
Articulation
knowledge
for (A B)
U
Knowledge
resource
A
7/26/2016
U
(B
C)
Knowledge
resource
C
Knowledge
resource
B
Gio Wiederhold - TITech 2000
(C
U
Legend:
U
U
for
D)
Knowledge
resource
D
39
Support Domain Specialization
• Knowledge Acquisition (20% effort) &
• Knowledge Maintenance (80% effort *)
to be performed
• Domain specialists
• Professional organizations
• Field teams of modest size
autonomously
maintainable
Empowerment
* based on experience with software
7/26/2016
Gio Wiederhold - TITech 2000
40
Summary Scalable Knowledge Composition
Provide for Maintainable Ontologies
• devolve maintenance onto many
domain-specific experts / authorities
• provide an algebra to compute
composed ontologies that are
limited to their articulation terms
SKC
• enable interpretation within the
source contexts
7/26/2016
Gio Wiederhold - TITech 2000
41
Many Other Tasks
at/near Stanford
Matching cell / protein 3D with chemical’s 3D
• Regulatory Gene motifs :
– Bioprospector [ Brutlag & Liu <www-cmgm.stanford.edu> ]
• Protein structure generation
– moving from small to larger proteins
1: Powerful parallel processing [IBM BlueGene]
2: Two-level : use features as an intermediate
(alpha-helix, beta-sheets, …)
3: Protein Folding speedup by delegation
[Shirts & Pande: foldingathome.stanford.edu ]
• RNA folding (simpler, larger) [Nakatani & Pande]
7/26/2016
Gio Wiederhold - TITech 2000
42
Provenance of derived data
Assure having a proper history of derived results
[ Peter Buneman, UPenn, www.humgen.upenn.edu ] K2 integration tool
Integrated databases often don’t indicate the original sources
I.e., SwissProt does not distinguish inferred versus being observed.
[ William Gelbart, Harvard University] Flybase
Flybase also collects data as exons and their mutations, tranposon insertion sites.
Moving from being Hunter Gatherers in science to Harvesters, moving to an
agronomical society
Clasical genomics is being superseded by expression and interaction of gene products
and gene perturbation.
[ Peter Karp, SRI Int., Bioinformatics Res.Group, www.ai.sri.com/pkarp/ ] EcoCyc
EcoCyc links proteins to 150 metabolic pathways in Ecoli
Databases are supplanting journals. They are re-analyzable. Results in journals are not.
Estimate now about 500 public databases for Bioinformatics; although not all
of them have APIs, use real DBMSs, have differing models, units of
measurements, leading to semantic problems.
7/26/2016
Gio Wiederhold - TITech 2000
43
The People Problem
The demand for people in bioinformatics is high,
at all levels
• Critical is a lack of
– training opportunities - programs and teachers
– available trainees
• Being in multi-disciplinary field is scary
– tenure for faculty
– load for students
– salary and growth differentials in biology and CS
• Some institutions are moving aggressively
– must compete with World-Wide Web visions
7/26/2016
Gio Wiederhold - TITech 2000
44
Bioinformatics:
Converting Data to Knowledge
• The means: People
• The product: Information
7/26/2016
Gio Wiederhold - TITech 2000
45
Up-to-dateness
100%
never
1/year
%tage
up-to-date
1/month
1/week
1/day
=effort, methods
F(user need)
50%
1/hour
1/minute
1/second
0%
Frequency
of source
change
7/26/2016
0 1
?
frequency of visits
as often as possible
Feb.2000
F(capability given 2.2M public sites with 288M pages )
Gio Wiederhold - TITech 2000
46
Privacy requires Ethics
Knowledge carries responsibilities.
How will people feel about your knowledge about them?
their genetic make-up,
physical & psychological propensities.
Privacy is hard to formalize,
but that does not mean it is not real to people.
Perceptions count.
(There is also real stuff insurance scams - personal relations )
Diagnostics without therapies.
7/26/2016
Gio Wiederhold - TITech 2000
47
Securing Collaboration
Collaborator
source query
certified result
Security Filter
certified query
Logs
unfiltered result
Private Patient Data
Gio Wiederhold TIHI Oct96 48
7/26/2016
Gio Wiederhold - TITech 2000
48
Download