14Segment2012 - Ganesha Associates

advertisement
Basic reading, writing and
informatics skills for biomedical
research
Segment 4. Other types of
database and browser
24 August 2012
Ganesha Associates
1
Biological databases
• A database is an indexed collection of information
• Some databases contain mainly text, but others contain image,
sequence or structural data
• A browser is a means of visualising this information and the
relationships between data elements
• There is a growing amount of information in publicly available
databases.
• For example, in 2011 the Nucleic Acids Research journal online
Molecular Biology Database Collection listed 1380.
• The National Center for Biotechnology Information (NCBI) and the
European Bioinformatics Institute(EBI) host some of the most
important databases used for biomedical research.
• Wikipedia also contains a list of biological databases
• Which databases are relevant to your project?
24 August 2012
Ganesha Associates
2
Data, data everywhere…
• “Rapid release of prepublication data has served the
field of genomics well.”
• “With close to one million gene-expression data sets now
in publicly accessible repositories, researchers can
identify disease trends without ever having to enter a
laboratory.”
• “Most researchers agree that open access to data is the
scientific ideal, so what is stopping it happening [in other
fields]?”
• “Earth scientists need better incentives, rewards and
mechanisms to achieve free and open data exchange”
24 August 2012
Ganesha Associates
3
The database problem
• Volume of digital data (both high throughput and
text)
– One second of HD video = 2000 pages of text
• Distributed systems and databases, lack of data
standards, incompatible data formats
• Costs of creation, curation and maintenance
• Retrieval: semantic search, metadata, images…
24 August 2012
Ganesha Associates
4
The problem – biomedical research
Gene
Expression
Warehouse
OMIM
Disease
ExPASy
SwissProt
PDB
ExPASy
Enzyme
Protein
Enzyme
LocusLink
Affy Fragment
Known Gene
MGD
Sequence
Metabolite
SNP
24 August 2012
SPAD
Sequence
Cluster
NCBI
dbSNP
Genbank
NMR
Pathway
UniGene
Ganesha Associates
KEGG
5
Cross-database search today - NCBI
24 August 2012
Ganesha Associates
6
The problem – biomedical research
24 August 2012
Ganesha Associates
7
The problem – biomedical research
24 August 2012
Ganesha Associates
8
The problem – healthcare
24 August 2012
Ganesha Associates
9
The problem - healthcare
JOURNAL of the AMERICAN MEDICAL ASSOCIATION (JAMA) Vol 284, No 4, July
26th 2000
•
•
•
•
•
2,000 deaths/year from unnecessary surgery
7,000 deaths/year from medication errors in hospitals
20,000 deaths/year from other errors in hospitals
80,000 deaths/year from infections in hospitals
106,000 deaths/year from non-error, adverse effects of medications
These total up to 225,000 deaths per year in the US from iatrogenic
causes which ranks these deaths as the # 3 killer.
Iatrogenic is a term used when a patient dies as a direct result of
treatments by a physician, whether it is from misdiagnosis of the
ailment or from adverse drug reactions used to treat the illness (drug
reactions are the most common cause).
24 August 2012
Ganesha Associates
10
The problem - healthcare
• 17 year innovation adoption curve from discovery into
accepted standards of practice
• Even if a standard is accepted, patients have a 50:50
chance of receiving appropriate care, a 5-10%
probability of incurring a preventable, anticipatable
adverse event
• Medical literature doubling every 19 years
– Doubles every 22 months for AIDS care
• 2 million facts needed to practice
• Genomics and personalized medicine will increase the
problem exponentially
• Typical drug order today with decision support accounts
for, at best, Age, Weight, Height, Labs, Other Active
Meds, Allergies, Diagnoses
24 August 2012
Ganesha Associates
11
So how will we find things in databases ?
• Search engine collects, indexes, parses,
and stores data to facilitate fast and
accurate information retrieval.
• Index design incorporates interdisciplinary
concepts from linguistics, cognitive
psychology, mathematics (statistics),
informatics, physics and computer
science.
24 August 2012
Ganesha Associates
12
Semantic levels
Definition
Synonyms
Classification
(is_a)
Properties
(has_a)
Other
relations
Keywords
Dictionary

Controlled
vocabulary


Thesaurus


Taxonomy





Ontology





24 August 2012
Ganesha Associates
22
The Gene Ontology organisation
• The objective of GO is to provide controlled
vocabularies for the description of the molecular
function, biological process and cellular
component of gene products.
• These terms are to be used as attributes of gene
products by collaborating databases, facilitating
uniform queries across them.
• The controlled vocabularies of terms are
structured to allow both attribution and querying
to be at different levels of granularity.
• http://www.geneontology.org
24 August 2012
Ganesha Associates
27
24 August 2012
Ganesha Associates
29
An example of annotation
Mitochondrial P450
(CC24 PR01238; MITP450CC24)
GO cellular component term:
mitochondrial inner membrane ;
GO:0005743
GO molecular function term:
monooxygenase activity ; GO:0004497
GO biological process term:
electron transport ; GO:0006118
24 August 2012
Ganesha Associates
33
MicroArray data analysis with GO
time
Defense response
Immune response
Response to stimulus
Toll regulated genes
JAK-STAT regulated genes
Puparial adhesion
Molting cycle
hemocyanin
Amino acid catabolism
Lipid metobolism
Peptidase activity
Protein catabloism
Immune response
Immune response
Toll regulated genes
24 August 2012
attacked control
Bregje
Wertheim at the Centre for Evolutionary Genomics,
Ganesha
Associates
35
Department of Biology, UCL and Eugene Schuster Group, EBI.
GoPubMed
• GoPubMed is a knowledge-based search engine
for biomedical texts. The Gene Ontology (GO)
and Medical Subject Headings (MeSH) serve as
"Table of contents" in order to structure the
millions of articles of the MEDLINE data base.
• GoPubMed is one of the first Web 2.0 search
engines.
• The system was developed at the Technical
University of Dresden by Michael Schroeder and
his team and at Transinsight.
• http://www.gopubmed.org
24 August 2012
Ganesha Associates
36
24 August 2012
Ganesha Associates
37
Medline Cognition
Cognition's Semantic NLP Understands:
Word stems - the roots of words;
Words/Phrases - with individual meanings of ambiguous words and phrases
listed out;
The morphological properties of each word/phrase, e.g., what type of plural
does it take, what type of past tense, how does it combine with affixes like "re"
and "ation";
How to disambiguate word senses - This allows Cognition's technology to pick
the correct word meaning of ambiguous words in context;
The synonym relations between word meanings;
The ontological relations between word meanings; one can think of this as a
hierarchical grouping of meanings or a gigantic "family tree of English" with
mothers, daughters, and cousins;
The syntactic and semantic properties of words. This is particularly useful with
verbs, for example. Cognition encodes the types of objects different verb
meanings can occur with.
24 August 2012
Ganesha Associates
38
24 August 2012
Ganesha Associates
39
iHOP
Information Hyperlinked over Proteins. iHOP provides the
network of genes and proteins as a natural way of
accessing the millions of abstracts in PubMed
24 August 2012
Ganesha Associates
40
iHOP
• The minimal information view contains general
information, like the symbol, name and organism of a
gene. Moreover it provides:
– Useful links to external resources (e.g. UniProt, NCBI, OMIM,
etc.)
– Links to other iHOP views on this gene
– Homologues
• Other views contain all sentences found in the literature:
– For the main gene of a page and other genes (gene B) which
iteract.
– That mention the main gene together with relevant biomedical
terms such as lymphoma.
• Sentences are ranked by significance, so that screening
over a few sentences will be usually sufficient to gain an
idea of a gene's function.
24 August 2012
Ganesha Associates
41
24 August 2012
Ganesha Associates
42
GenMAPP
• GenMAPP is a free computer application
designed to visualize gene expression and other
genomic data on maps representing biological
pathways and groupings of genes.
• Integrated with GenMAPP are programs to
perform a global analysis of gene expression or
genomic data in the context of hundreds of
pathway MAPPs and thousands of Gene
Ontology Terms.
24 August 2012
Ganesha Associates
43
Automatic rendering of pathway interactions
24 August 2012
Ganesha Associates
44
Other ways to search – BLAST,
PubChem, UCSC Genome Browser
By sequence – BLAST:
>DinoDNA from JURASSIC PARK p. 103 nt 1-1200
GAATTCCGGAAGCGAGCAAGAGATAAGTCCTGGCATCAGATACAGTTGGAGA
TAAGGACGGACGTGTGGCAGCTCCCGCAGAGGATTCACTGGAAGTGCATTA
CCTATCCCATGGGAGCCATGGAGTTCGTGGCGCTGGGGGGGCCGGATGCG
GGCTCCCCCACTCCGTTCCCTGATGAAGCCGGAGCCTTCCTGGGGCTGGGG
GGGGGCG
By structure – PubChem:
24 August 2012
Ganesha Associates
45
Example of BLAST search results
24 August 2012
Ganesha Associates
46
PC Compound Record
24 August 2012
Ganesha Associates
47
UCSC Genome Browser
• The Genome Browser zooms and scrolls over
chromosomes, showing the work of annotators
worldwide.
• The Gene Sorter shows expression, homology and other
information on groups of genes that can be related in
many ways.
• Blat quickly maps your sequence to the genome. The
Table Browser provides convenient access to the
underlying database.
• VisiGene lets you browse through a large collection of in
situ mouse and frog images to examine expression
patterns.
• Genome Graphs allows you to upload and display
genome-wide data sets.
24 August 2012
Ganesha Associates
48
24 August 2012
Ganesha Associates
49
Download