14Segment - Ganesha Associates

advertisement
Basic reading, writing and
informatics skills for biomedical
research
Segment 4. Other types of
database and browser
10 April 2008
Copyright: Ganesha Associates
2008
1
Biological databases
• A database is an indexed collection of information
• Some databases contain mainly text, but others contain
sequence of structural data
• A browser is a means of visualising this information and
the relationships between data elements
• There is a growing amount of information in publicly
available databases.
• Each year, the journal Nucleic Acids Research publishes
an annual database issue. The 2007 issue lists 968
editorially selected biomolecular databases
10 April 2008
Copyright: Ganesha Associates
2008
2
The database problem
•
•
•
•
•
•
•
•
Volume of data (both high throughput and text)
Complexity
Distributed systems and databases
Incompatible data formats
Multi-disciplinary
Multi-lingual
Inability to share knowledge
Ambiguity of terminology
10 April 2008
Copyright: Ganesha Associates
2008
3
The problem – biomedical research
Gene
Expression
Warehouse
OMIM
Disease
ExPASy
SwissProt
PDB
ExPASy
Enzyme
Protein
Enzyme
LocusLink
Affy Fragment
Known Gene
MGD
Sequence
Metabolite
SNP
10 April 2008
SPAD
Sequence
Cluster
NCBI
dbSNP
Genbank
NMR
Pathway
UniGene
Copyright: Ganesha Associates
2008
KEGG
4
The question – biomedical research
10 April 2008
Copyright: Ganesha Associates
2008
5
The problem – biomedical research
10 April 2008
Copyright: Ganesha Associates
2008
6
The problem - pharmabiotech
10 April 2008
Copyright: Ganesha Associates
2008
7
The problem - healthcare
• 17 year innovation adoption curve from discovery into
accepted standards of practice
• Even if a standard is accepted, patients have a 50:50
chance of receiving appropriate care, a 5-10%
probability of incurring a preventable, anticipatable
adverse event
• Medical literature doubling every 19 years
– Doubles every 22 months for AIDS care
• 2 million facts needed to practice
• Genomics, Personalized Medicine will increase the
problem exponentially
• Typical drug order today with decision support accounts
for, at best, Age, Weight, Height, Labs, Other Active
Meds, Allergies, Diagnoses
10 April 2008
Copyright: Ganesha Associates
2008
8
The problem - healthcare
JOURNAL of the AMERICAN MEDICAL ASSOCIATION (JAMA) Vol 284, No 4, July
26th 2000
•
•
•
•
•
2,000 deaths/year from unnecessary surgery
7,000 deaths/year from medication errors in hospitals
20,000 deaths/year from other errors in hospitals
80,000 deaths/year from infections in hospitals
106,000 deaths/year from non-error, adverse effects of medications
These total up to 225,000 deaths per year in the US from iatrogenic
causes which ranks these deaths as the # 3 killer.
Iatrogenic is a term used when a patient dies as a direct result of
treatments by a physician, whether it is from misdiagnosis of the
ailment or from adverse drug reactions used to treat the illness (drug
reactions are the most common cause).
10 April 2008
Copyright: Ganesha Associates
2008
9
How do we find things in databases ?
• Search engine indexing collects, parses,
and stores data to facilitate fast and
accurate information retrieval.
• Index design incorporates interdisciplinary
concepts from linguistics, cognitive
psychology, mathematics (statistics),
informatics, physics and computer
science.
10 April 2008
Copyright: Ganesha Associates
2008
10
Indexing
• The purpose of storing an index is to optimize speed and
performance in finding relevant documents for a search
query.
• Without an index, the search engine would scan every
document in the corpus, which would require
considerable time and computing power.
• For example, while an index of 10,000 documents can
be queried within milliseconds, a sequential scan of
every word in 10,000 large documents could take hours.
• The additional computer storage required to store the
index, as well as the considerable increase in the time
required for an update to take place, are traded off for
the time saved during information retrieval.
10 April 2008
Copyright: Ganesha Associates
2008
11
Inverted indexing
• An inverted index is an index data structure
storing a mapping from content, such as words
or numbers, to its locations in a database file, or
in a document or a set of documents, in this
case allowing full text search.
• There are two main variants of inverted indexes:
– A record level inverted index contains a list of
references to documents for each word.
– A word level inverted index additionally contains the
positions of each word within a document.
– The latter form offers more functionality (like phrase
searches), but needs more time and space to be
created.
10 April 2008
Copyright: Ganesha Associates
2008
12
Example
• Texts T0 = "it is what it is", T1 = "what is it" and
T2 = "it is a banana", have the following inverted
file index (where the integers in the brackets
refer to the subscripts T0, T1 etc.):
–
–
–
–
–
"a": {2}
"banana": {2}
"is": {0, 1, 2}
"it": {0, 1, 2}
"what": {0, 1}
• A search for the terms "what", "is" and "it" would
give the set {0,1}
10 April 2008
Copyright: Ganesha Associates
2008
13
Example (cont’d)
• In the full inverted index, where the pairs are document
numbers and local word numbers, "banana": {(2, 3)}
means the word "banana" is in the third document (T2),
and it is the fourth word in that document (position 3):
–
–
–
–
–
"a": {(2, 2)}
"banana": {(2, 3)}
"is": {(0, 1), (0, 4), (1, 1), (2, 1)}
"it": {(0, 0), (0, 3), (1, 2), (2, 0)}
"what": {(0, 2), (1, 0)}
• A phrase search for "what is it“ gets hits for all the words
in both document 0 and 1, but the terms occur only
consecutively in document 1.
10 April 2008
Copyright: Ganesha Associates
2008
14
Indexing algorithms
• Semantic
–
–
–
–
–
Stop words
Stemming
Synonyms
Thesauri
Ontologies
• Syntactic
– Word order
– Word type
– Natural language processing
• Statistical
– Word frequency
– Word proximity
10 April 2008
Copyright: Ganesha Associates
2008
15
PubMed Related Articles Algorithm (I)
• The neighbors of a document are those documents in
the database that are the most similar to it.
• The similarity between documents is measured by the
words they have in common, with some adjustment for
document lengths.
• To carry out such a program, one must first define what a
word is.
• For us, a word is basically an unbroken string of letters
and numerals with at least one letter of the alphabet in it.
• Words end at hyphens, spaces, new lines, and
punctuation.
• A list of 310 common, but uninformative, words (also
known as stopwords) are eliminated from processing at
this stage.
10 April 2008
Copyright: Ganesha Associates
2008
16
PubMed Related Articles Algorithm (II)
• Next, a limited amount of stemming of words is done.
• Words from the abstract of a document are classified as
text words.
• Words from titles are also classified as text words, but
words from titles are added in a second time to give
them a small advantage in the local weighting scheme.
• MeSH terms are placed in a third category, and a MeSH
term with a subheading qualifier is entered twice, once
without the qualifier and once with it.
• These three categories of words (or phrases in the case
of MeSH) comprise the representation of a document.
• No other fields, such as Author or Journal, enter into the
calculations.
• See http://ii.nlm.gov/MTI/related.shtml for more info.
10 April 2008
Copyright: Ganesha Associates
2008
17
Ontologies, thesauri and taxonomies
• An ontology is a controlled vocabulary that
describes objects and the relations between
them in a formal way, and has a grammar for
using the vocabulary terms to express
something meaningful within a specified domain
of interest.
• A thesaurus is a controlled list of terms linked
together by semantic, hierarchical, and
associative or equivalence relationships.
• A taxonomy is a set of interdependent concepts
arranged in a lattice based on their relationships.
10 April 2008
Copyright: Ganesha Associates
2008
18
Semantic inference
Keywords
Discovery

Dictionary

Controlled
Vocabulary
Thesaurus
Taxonomy







Integration

Prediction
10 April 2008
Ontology
Copyright: Ganesha Associates
2008
19
Semantic levels
Definition
Synonyms
Classification
(is_a)
Properties
(has_a)
Other
relations
Keywords
Dictionary

Controlled
vocabulary


Thesaurus


Taxonomy





Ontology





10 April 2008
Copyright: Ganesha Associates
2008
20
The Medical Subject Headings
classification
• Controlled vocabulary, thesaurus.
• MeSH terms are arranged in a hierarchy of "MeSH Tree
Structures".
• When PubMed searches a MeSH term, it will
automatically include narrower terms in the search, if
applicable. This is also called "automatic explosion."
• When you click Go, PubMed will look for a match in up to
four lists. It looks first for a match in the MeSH
Translation Table. If it doesn't find a match, it looks in the
Journals Translation Table, then in the Phrase List, and
finally in the Author Index.
10 April 2008
Copyright: Ganesha Associates
2008
21
10 April 2008
Copyright: Ganesha Associates
2008
22
10 April 2008
Copyright: Ganesha Associates
2008
23
10 April 2008
Copyright: Ganesha Associates
2008
24
The Gene Ontology organisation
• The objective of GO is to provide controlled
vocabularies for the description of the molecular
function, biological process and cellular
component of gene products.
• These terms are to be used as attributes of gene
products by collaborating databases, facilitating
uniform queries across them.
• The controlled vocabularies of terms are
structured to allow both attribution and querying
to be at different levels of granularity.
• http://www.geneontology.org
10 April 2008
Copyright: Ganesha Associates
2008
25
Gene Ontology organisation
• GO collaborators have developed three structured,
controlled vocabularies (ontologies) that describe gene
products in terms of their associated biological
processes, cellular components and molecular functions
in a species-independent manner.
• There are three separate aspects to this effort:
– They write and maintain the ontologies themselves
– They make cross-links between the ontologies and the genes
and gene products in the collaborating databases
– They develop tools that facilitate the creation, maintainence and
use of ontologies.
• Useful links: http://www.amigo.org
10 April 2008
Copyright: Ganesha Associates
2008
26
10 April 2008
Copyright: Ganesha Associates
2008
27
10 April 2008
Copyright: Ganesha Associates
2008
28
Clark et al., 2005
Is_a and part_of relationships
is_a
part_of
10 April 2008
Copyright: Ganesha Associates
2008
29
An example of annotation
Mitochondrial P450
(CC24 PR01238; MITP450CC24)
GO cellular component term:
mitochondrial inner membrane ;
GO:0005743
GO molecular function term:
monooxygenase activity ; GO:0004497
GO biological process term:
electron transport ; GO:0006118
10 April 2008
Copyright: Ganesha Associates
2008
30
MicroArray data analysis with GO
time
Defense response
Immune response
Response to stimulus
Toll regulated genes
JAK-STAT regulated genes
Puparial adhesion
Molting cycle
hemocyanin
Amino acid catabolism
Lipid metobolism
Peptidase activity
Protein catabloism
Immune response
Immune response
Toll regulated genes
10 April 2008
attacked control
Bregje Wertheim
at the Centre for Evolutionary Genomics,
Copyright: Ganesha
Associates
cted Gene
Tree:
pearson
Coloredby:
by:
pearson
lw n3d
... lw n3d ... Colored
nch color
classification:
Set_LW_n3d_5p_...
Gene
List:
Set_LW_n3d_5p_...
Gene
List:
31
Department
of Biology, UCL and Eugene Schuster Group, EBI.
2008
Copy
of Copy
C5_RMA
Copy
ofofCopy
of(Defa...
C5_RMA (Defa...
allall
genes
(14010)(14010)
genes
GoPubMed
• GoPubMed is a knowledge-based search engine
for biomedical texts. The Gene Ontology (GO)
and Medical Subject Headings (MeSH) serve as
"Table of contents" in order to structure the
millions of articles of the MEDLINE data base.
• GoPubMed is one of the first Web 2.0 search
engines.
• The system was developed at the Technical
University of Dresden by Michael Schroeder and
his team and at Transinsight.
• http://www.gopubmed.org
10 April 2008
Copyright: Ganesha Associates
2008
32
10 April 2008
Copyright: Ganesha Associates
2008
33
Medline Cognition
Cognition's Semantic NLP Understands:
Word stems - the roots of words;
Words/Phrases - with individual meanings of ambiguous words and phrases
listed out;
The morphological properties of each word/phrase, e.g., what type of plural
does it take, what type of past tense, how does it combine with affixes like "re"
and "ation";
How to disambiguate word senses - This allows Cognition's technology to pick
the correct word meaning of ambiguous words in context;
The synonym relations between word meanings;
The ontological relations between word meanings; one can think of this as a
hierarchical grouping of meanings or a gigantic "family tree of English" with
mothers, daughters, and cousins;
The syntactic and semantic properties of words. This is particularly useful with
verbs, for example. Cognition encodes the types of objects different verb
meanings can occur with.
10 April 2008
Copyright: Ganesha Associates
2008
34
10 April 2008
Copyright: Ganesha Associates
2008
35
iHOP
Information Hyperlinked over Proteins. iHOP provides the
network of genes and proteins as a natural way of
accessing the millions of abstracts in PubMed
10 April 2008
Copyright: Ganesha Associates
2008
36
iHOP
• The minimal information view contains general
information, like the symbol, name and organism of a
gene. Moreover it provides:
– Useful links to external resources (e.g. UniProt, NCBI, OMIM,
etc.)
– Links to other iHOP views on this gene
– Homologues
• Other views contain all sentences found in the literature:
– For the main gene of a page and other genes (gene B) which
iteract.
– That mention the main gene together with relevant biomedical
terms such as lymphoma.
• Sentences are ranked by significance, so that screening
over a few sentences will be usually sufficient to gain an
idea of a gene's function.
10 April 2008
Copyright: Ganesha Associates
2008
37
10 April 2008
Copyright: Ganesha Associates
2008
38
10 April 2008
Copyright: Ganesha Associates
2008
39
GenMAPP
• GenMAPP is a free computer application
designed to visualize gene expression and other
genomic data on maps representing biological
pathways and groupings of genes.
• Integrated with GenMAPP are programs to
perform a global analysis of gene expression or
genomic data in the context of hundreds of
pathway MAPPs and thousands of Gene
Ontology Terms.
10 April 2008
Copyright: Ganesha Associates
2008
40
10 April 2008
Copyright: Ganesha Associates
2008
41
Automatic rendering of pathway interactions
10 April 2008
Copyright: Ganesha Associates
2008
42
Other ways to search – BLAST,
PubChem, UCSC Genome Browser
By sequence – BLAST:
>DinoDNA from JURASSIC PARK p. 103 nt 1-1200
GAATTCCGGAAGCGAGCAAGAGATAAGTCCTGGCATCAGATACAGTTGGAGA
TAAGGACGGACGTGTGGCAGCTCCCGCAGAGGATTCACTGGAAGTGCATTA
CCTATCCCATGGGAGCCATGGAGTTCGTGGCGCTGGGGGGGCCGGATGCG
GGCTCCCCCACTCCGTTCCCTGATGAAGCCGGAGCCTTCCTGGGGCTGGGG
GGGGGCG
By structure – PubChem:
10 April 2008
Copyright: Ganesha Associates
2008
43
Example of BLAST search results
10 April 2008
Copyright: Ganesha Associates
2008
44
PC Compound Record
10 April 2008
Copyright: Ganesha Associates
2008
45
UCSC Genome Browser
• The Genome Browser zooms and scrolls over
chromosomes, showing the work of annotators
worldwide.
• The Gene Sorter shows expression, homology and other
information on groups of genes that can be related in
many ways.
• Blat quickly maps your sequence to the genome. The
Table Browser provides convenient access to the
underlying database.
• VisiGene lets you browse through a large collection of in
situ mouse and frog images to examine expression
patterns.
• Genome Graphs allows you to upload and display
genome-wide data sets.
10 April 2008
Copyright: Ganesha Associates
2008
46
10 April 2008
Copyright: Ganesha Associates
2008
47
Cross-database search - NCBI
10 April 2008
Copyright: Ganesha Associates
2008
48
And for the future ?
10 April 2008
Copyright: Ganesha Associates
2008
49
Practical activity 4 - Non-bibliographic
databases
• Total duration - ca. 2 hours.
• If you are a geneticist, biochemist, cell biologist, go to the NCBI
Minicourses page and do one of the courses described there. These
resources are also valuable if you are interested in the molecular
biology of disease
• If you are a medicinal chemist, or a pharmacologist take a look at
the PubChem resource and find out how you can find links from a
given compound to related data such as bioactivity studies, literature
abstracts, protein sequences, protein structures, genes and
diseases
• If you are a clinician, find out more about evidence-based medicine
and apply the PICO approach to building a specific focused,
answerable question using PubMed.
• If you are none of the above, short-list the database resources
relevant to your field of interest
• Discuss your findings with the class.
10 April 2008
Copyright: Ganesha Associates
2008
50
Download