What is Bioinformatics?

advertisement
Bioinformatics: Definitions,
Challenges and Impact on
Health Care Systems
Joyce Mitchell, Ph.D.
University of Utah
Sept 29, 2005
NLM’s Wood’s Hole Informatics Course
1
Outline for Talk
1.
2.
3.
4.
5.
What is Bioinformatics?
Health Informatics compared to
Bioinformatics
Problems considered in Bioinformatics
•
Genomics, proteomics, transcriptomics, etc
Genomics data and patient care
Impact of Bioinformatics on Health
Information Systems
2
Central Dogma of Molecular
Biology
Transcription
DNA
Replication
RNA
Protein
Phenotype
Phenotype
Translation
This happens in Cells.
3
1. What is Bioinformatics?
Definitions first
4
NIH Working Definition
Bioinformatics: Research, development,
or application of computational tools and
approaches for expanding the use of
biological, medical, behavioral or health
data, including those to acquire, store,
organize, archive, analyze, or visualize
such data.
http://www.bisti.nih.gov/CompuBioDef.pdf
5
Another Definition

An interdisciplinary area at the intersection of
biological, computer, and information
sciences necessary to manage, process, and
understand large amounts of data, for
instance from the sequencing of the human
genome, or from large databases containing
information about plants and animals for use
in discovering and developing new drugs.
www.isye.gatech.edu/~tg/publications/ecology/eolss/node2.html
6
Another definition
NCBI
(National Center for Biotechnology Information
Bioinformatics is the field of science in which
biology, computer science, and information
technology merge into a single discipline.
The ultimate goal of the field is to enable the
discovery of new biological insights and to
create a global perspective from which
unifying principles in biology can be
discerned. There are sub-disciplines in
bioinformatics.
http://www.ncbi.nlm.nih.gov/About/primer/bioinformatics.html
7
2. Health Informatics
Compared to
Bioinformatics
Same methods, different
application domains
8
Different Areas of Strengths

Bioinformatics has much more data
available on the Internet than Health
Informatics
• Much more progress on database integration
across multiple data sources

Health Informatics has much more need
for aggregation of national statistics
• Much more progress on terminologies for
integration of data
9
Bioinformatics & Health Informatics




Bioinformatics is the study of the flow of
information in biological sciences.
Health Informatics is the study of the flow of
information in patient care.
These two field are on a collision course as
genomics data becomes used in patient care.
Russ Altman,MD, Ph.D., Stanford Univ.
10
3. Problems Considered in
Bioinformatics
OMES and OMICS
11
Omes and Omics




Genomics
•
•
Primarily sequences (DNA and RNA)
Databanks and search algorithms
Proteomics
•
•
•
Sequences (Protein)
Mass spectrometry, X-ray crystallography
Databanks, knowledge bases, terminologies
Functional Genomics (transcriptomics)
•
•
Microarray data
Databanks, analysis tools, traversal techniques
Systems Biology (metabolomics)
•
•
Metabolites and interacting systems (interactomics)
Graphs, visualization, modeling, networks of entities
12
Central Dogma of Molecular
Biology
DNA
Genomics
RNA
Protein
Transcriptomics
Phenotype
Phenotype
Proteomics
Functional Genetics
13
Genome and Genomics



Genome – entire complement of DNA in a species
•
•
Both nuclear and mitochondrial/chloroplast
Variants among individuals
Genomics – study of the sequence, structure and function
of the genome. Study of whole sets of genes rather than
single genes.
Comparative genomics – study of the differences among
species. Usually covers evolutionary studies of
differences & conservation over time.
14
A Genome Database (e.g.,
GenBank)



Consists of long strings of DNA bases –
ATCG…..
Consists of “annotations” of this
database to attach meaning to the
sequence data.
Example entry from GenBank:
• http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi
?val=NM_000410&dopt=gb
Hemochromatosis gene HFE
15
Human Genome Project




Human Genome Project - International
research effort
Determine sequence of human genome and
other model organisms
Began 1990, completed 2003
Next steps for ~20,000 genes
•
•
•
Function and regulation of all genes
Significance of variations between people
Cures, therapies, “genomic healthcare”
16
“The Human Genome Project
has catalyzed striking paradigm
changes in biology - biology is
an information science.”
Leroy Hood, MD, PhD
Institute for Systems Biology
Seattle, Washington
17
Genomes In Public Databases
12/4/01
10/3/02
8/28/03
9/16/05
72
104
156
297
Ongoing prokaryotic genomes: 255
316
386
737
218
246
526
Published complete genomes:
Ongoing eukaryotic genomes:
158
1560
http://www.genomesonline.org/
18
Genomics activities





Sequence the genes and chromosomes –
done by breaking the DNA into parts
Map the location of various gene entities to
establish their order
Compare the sequences with other known
sequences to determine similarity
•
•
Across species, conserved sequence “motifs”
Predict secondary structure of proteins
•
BLAST and its many forms
Create large databases – GenBank, EMBL, DDBJ
Develop algorithms and similarity measures
19
Central Dogma of Molecular
Biology
DNA
RNA
Protein
Phenotype
Phenotype
Tissues
Organs
Organisms
Genomics
Transcriptomics
Proteomics
Functional Genetics
20
Proteome and Proteomics


Proteome – the entire set of proteins
(and other gene products) made by the
genome.
Proteomics – study of the interactions
among proteins in the proteome,
including networks of interacting proteins
and metabolic considerations. Also
includes differences in developmental
stages, tissues and organs.
21
Protein Functions





Catalysis
Transport
Nutrition and storage
Contraction and
mobility
Structural elements
•
•



Defense
mechanisms
Regulation
•
•
Genetic
Hormonal
Buffering capacity
Cytoskeleton
Basement membranes
22
Protein Databases






SwissProt
PIR
UniProt
http://www.pir.uniprot.org/
GENE http://www.ncbi.nlm.nih.gov/gene
InterPro http://www.ebi.ac.uk/interpro/
Correspond to (and derived from) Genome
data bases
All connected by Reference Sequences
(NCBI)
23
Gene/Protein Database entries


HFE record in Entrez GENE (NCBI)
http://www.ncbi.nlm.nih.gov/entrez/query.
fcgi?&db=gene&cmd=retrieve&dopt=Gra
phics&list_uids=3077
24
Structure & Function Determination





X-ray crystallography
Nuclear magnetic resonance
spectroscopy and tandem MS/MS
Computational modeling
Sequence alignment from others
Homology modeling
25
Structure Databases



Contain experimentally determined and
predicted structures of biological molecules
Most structures determined by X-ray
crystallography, NMR
Example – MMDB molecular modeling db
http://www.ncbi.nlm.nih.gov/Structure/MMDB/mmdb.shtml

HFE Entry
•
http://www.ncbi.nlm.nih.gov/Structure/mmdb/mmdbsrv.
cgi?form=6&db=t&Dopt=s&uid=9816
26
Protein Interaction Databases



Record observations of protein-protein interactions in cells
Attempts to detail interactions observed in thousands of
small-scale experiments described in published articles
Examples:
•
•
•
•
•
BIND: Biomolecular Interaction Network Database
DIP: Database of Interacting Proteins
MIPS: Munich Information Center for Protein Sequences
PRONET: Protein interaction on the Web
Many others, both academic and commercial
27
Central Dogma of Molecular
Biology
DNA
Genomics
RNA
Protein
Transcriptomics
Phenotype
Phenotype
Proteomics
Functional Genetics
28
Proteome vs Transcriptome



Functional genomics (transcriptomics)
looks at the timing and regulation of the
gene products (both RNA and proteins)
This is different from looking at what
gene products can be produced – it
looks at the circumstances under which
production occurs.
Involves experimental conditions.
29
Functional Genomics –
Microarrays




Transcriptome and transcriptomics
High throughput technique designed to
measure the increase in RNA (or
sometimes proteins, tissues, etc) in a
cell in response to an experiment.
Also called “gene expression” analysis
Microarrays called “gene chips” (although
now there are protein and tissue chips)
30
How Do Microarrays Work?

Conceptual description:
•
•
•
•
Set of targets (cDNA, proteins, tissues, etc) are
immobilized in predetermined positions on a substrate
Solution containing tagged molecules capable of
binding to the targets is placed over the targets
Binding occurs between targets and tagged
molecules.
Fluorescent tags allow you to visualize which targets
have been bound (and tell you something about the
molecules that were present in your solution).
31
Animation of Microarrays

http://www.bio.davidson.edu/courses/gen
omics/chip/chip.html
32
How Do Microarrays Work?

Conceptual description:
•
•
•
•
Set of targets (cDNA, proteins, tissues, etc) are
immobilized in predetermined positions on a substrate
Solution containing tagged molecules capable of
binding to the targets is placed over the targets
Binding occurs between targets and tagged
molecules.
Fluorescent tags allow you to visualize which targets
have been bound (and tell you something about the
molecules that were present in your solution).
33
How Spotted Arrays Work

Result:
• Spots where cDNA from the reference sample
•
•
•
hybridized look green
Spots where cDNA from the experimental
sample hybridized look red
Spots where cDNA from both samples
hybridized look yellow (green+red=yellow)
Spots with little/no cDNA hybridized look
black
34
Uses of Expression Profiling


Pharmaceutical research:
•
ID drug targets by comparing expression profile of
drug-treated cells with those of cells containing
mutations in genes encoding known drug targets
Disease Dx and Tx:
•
•
Distinguish morphologically similar cancers
• DLBCL (Poulsen et al (2005) Microarray-based
classification of diffuse large B-cell lymphomas
European Journal of Haematology 74(6):453-65.))
Therapy potential
• Rabson AB, Weissmann D. From microarray to bedside:
targeting NF-kappaB for therapy of lymphomas. Clin
Cancer Res. 2005 Jan 1;11(1)2-6.
37
Future Applications

Diagnostic tool to screen for infective
agents
• Chip imprinted with set of pathogenic
genomes used to identify bacterial, viral, or
parasite genomic material in patient’s body
fluids

Diagnostic chip to check for mutations
involved in drug-gene interactions.
38
Experimental Design (2)

A fundamental challenge of microarray
experiments: underdetermined systems
Kohane IS, Kho
AT, Butte AJ.
Microarrays for
an Integrative
Genomics. (The
MIT Press;
Cambridge, MA;
2003), p. 11.
MGED
Microarray gene expression
data
“Standards for minimum data to be
exchanged”
“Standards for format of messages to
exchange the data”
MIAME
MAGE
minimum information that
should be reported about a
microarray experiment to
enable its unambiguous
interpretation and
reproduction
a standard
transmission
format for
microarray
experiment data
http://www.mged.org/Workgroups/
MIAME/miame.html
http://www.mged.org/Workgroups/
MAGE/mage.html
Public Microarray Data Repositories
Major public repositories:
 GEO (NCBI)
• http://www.ncbi.nlm.nih.gov/geo/

ArrayExpress (EBI)
• http://www.ebi.ac.uk/arrayexpress/
41
Standards and Repositories


Brazma, A, et al. Minimum information about a microarray
experiment (MIAME)-toward standards for microarray
data. Nature Genetics. 2001 Dec;29(4):373.
http://www.nature.com/cgitaf/DynaPage.taf?file=/ng/journal/v29/n4/full/ng1201365.html
Ball, CA, et al. Submission of Microarray Data to Public
Repositories. PLoS Biology. 2004 September; 2 (9): e317
http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=
pubmed&pubmedid=15340489
42
Controlled Vocabularies


Genomics, proteomics, and especially
microarray techniques have created a large
need for controlled vocabularies to assist the
analyses across multiple entities & species.
Taxonomy – systematic classification of objects
according to relationships.

Ontologies –
•
An organizational framework for concepts
43
Controlled Vocabularies in
Bioinformatics


The Gene Ontology http://www.geneontology.org/
•
•
Knowledge capture (the ontology itself)
Annotation of gene products (for comparisons)
The MGED Ontology (arising from MIAME)
•
http://mged.sourceforge.net/
• Annotation of microarray experiments for public


repositories
Clinical Bioinformatics Ontology:
•
•
Annotation of gene tests in electronic medical records
http://www.cerner.com/cbo
MIAPE from Proteomics Standards Initiative (PSI)
•
http://psidev.sourceforge.net/
44
4. Genomics Data and
Patient Care
From genotype to phenotype
45
Bioinformatics and Patient Care



Understanding a person’s genome ushers
the era of “Personalized Medicine”
Obviously you should keep track of healthrelated genetic data in the EMR.
The 9-11 disaster showed you need to know
the genomic variant information as well.
•
Cash et al. Forensic bioinformatics in the wake of the World Trade Center
Disaster. PSB 2003:638-653.
46
Human Disease Gene Specifics
Genes linked to human
diseases (9-2004)
 + 425 in 2 yrs
 1700/20,000 = 9% of
loci
1800
1600
1400
1200
1000
800
600
400
200
0
Loci
2002 2003 2004
47
Genetic Medicine is not new




Karl Landsteiner started genetic
medicine over 100 years ago (1903)
Blood transfusions worked off the ABO
blood group system.
Landsteiner got the Nobel Prize in 1930
for his work.
http://nobelprize.org/medicine/laureates/1930/landsteiner-bio.html
48
Genomic Medicine is New


What to do with all of this genetic
information and every person being
unique?
And the information about genetic
conditions is available on the Internet.
49
Genomics Data and Patient Care


Where do you find the data for genes
causing human diseases?
What do you do with genetic data in
electronic medical records?
50
Where do you find the data for
genes causing human diseases?

Study on availability of genetic data on
health implications of the HGP.
•
Mitchell, McCray, Bodenreider. Methods Inf Med 2003; 42:557-63.
51
Questions








What genes cause the condition?
What are the normal function of the gene?
What mutations have been linked to diseases?
How does the mutation alter gene function?
What laboratories are performing DNA tests?
Are there gene therapies or clinical trials?
What names are used to refer to the genes
and the diseases?
What other conditions are linked to these same
genes?
52
You can find the answers online





… but it is not easy; answers in many places
Can’t navigate by genes names - must use hot links
and numeric identifiers
The number and function of alternate forms of the
protein are inconsistently reported
Synonymy (many names, same meaning) and
polysemy (same name, different meanings) cause
confusion
Upper and lower case are used for species
distinctions
53
Major Challenges of Navigation
Complexity of data
 Dynamic nature of the data
 Diverse foci and number of
data/knowledge base systems
 Data and knowledge representation lack
standards
Can navigate if you know what you are
looking for.

54
Genetics Home Reference


Consumer health resource to help the public
navigate from phenotype to genotype.
Focus on health implications of the Human
Genome Project.

http://ghr.nlm.nih.gov

Mitchell, Fun, McCray, JAMIA, 2004 Nov 11(6):439-437
55
Hands-on with GHR

Scavenger hunt with hemochromatosis
and the genes that influence it.
Explore the Genetics Home Reference by
answering the following questions. Start
at http://ghr.nlm.nih.gov .
56
GHR Scavenger Hunt




How common is hemochromatosis?
How many genes have been proven to
be involved in hemochromatosis when
the genes are mutated?
What are the symbols for these genes?
Can you find the link to MedlinePlus with
health information on hemochromatosis?
57
GHR Scavenger Hunt



What are the names of the patient
support associations for
hemochromatosis?
One synonym for this condition is
“bronze diabetes”. Can you find a
reason for this?
What kind of damage is done to the liver
of people with hemochromatosis?
58
GHR Scavenger Hunt



For the genes involved in hemochromatosis,
how many of them are available as a DNA
test?
Give one place where you would choose to
send a tissue sample for DNA testing.
What sites are listed under “Research
Resources” for the TFR2 gene?
•
•
How many alternately spliced proteins for TFR2?
In what tissues is this gene expressed?
59
GHR Scavenger Hunt




How do people inherit hemochromatosis?
Do the genes involved in hemochromatosis
cause other health conditions when they are
mutated?
Can you find a protein sequence for one of the
genes?
What clinical trials are available for
hemochromatosis patients close to where you
live?
60
5. Impact of
Bioinformatics on Health
Information Systems
Electronic Medical Record
Public Health Systems
61
Genetics is Impacting Medicine Today!



1700 genes & health conditions
> 1100 gene tests for diagnosis
Relate to diagnosis, therapy, drug
dosage, occupational hazards,
reproductive plans, health risks, ….
62
Well-known Examples



Pharmacogenetics:
• CYP450 alleles: exaggerated, diminished or ultrarapid drug responses. E.G., Warfarin. 93% of
patients are OK on standard doses. 7% of patients
have severe hemorrhage. CYP2C9*2 and CYP2C9*3
most severe of 6 known mutations.
Environmental susceptibility
• Sickle Cell trait carrier and malaria parasite
Nutrition
• PKU and avoidance of phenylalanine
63
Another Example: Iressa



(gefitinib)
Non-small cell lung CA ~ 140,000 pt/yr
Iressa (Astra Zeneca) causes remission in 1 of
10 patients if taken daily for life.
Iressa efficacy correlates with EGFR mutation
in the tumor. Now have gene testing for EGFR
so can target appropriate people.
http://www.sciencemag.org/cgi/content/full/305/5688/1222a

BUT – Astra Zeneca can’t make money on only
14,000 per year.
http://www.ncbi.nlm.nih.gov/entrez/dispomim.cgi?id=131550
64
Collie Dog Example

Collies are more sensitive to the anti-parasytic
drug invermectin and loperamide (imodium)
and other drugs
75% of collies in US have a mutation in the
mdr1 gene causing multiple drug sensitivity (50
drugs). Can cause death or neurological
damage.
Now have testing available.

http://www.wral.com/money/3565592/detail.html


65
Implications for Health Care System

More gene tests will be ordered. [reports of
300% increase in gene tests in 2003.]
•




Arch Pathol Lab Med – 2004, 128(12):1330-1333
The FDA will regulate panels of tests.
•
http://www.fda.gov/bbs/topics/news/2004/new01149.html
Non-discrimination laws for insurance and
employment would open a floodgate.
Preventive healthcare will play a larger part.
Environmental risk factors dictate OSHA-type
approach to worker empowerment and
education about safe behavior
66
Example: Hemochromatosis

2 copies of mutated HFE gene - too much
iron absorbed from diet, which accumulates.
Causes arthritis, liver disease, diabetes, skin
discoloration.
•


(1 million people in US)
HFE gene regulates the storage, transport
and absorption of iron
Labs doing gene tests use different
techniques: full sequence vs limited analysis
67
A Portion of the HFE DNA Sequence
ATGGGCCCGCGAGCCAGGCCGGCGCTTCTCCTCCTGATGCTTTTGCAGA
CCGCGGTCCTGCAGGGGCGCTTGCTGCGTTCACACTCTCTGCACTA
CCTCTTCATGGGTGCCTCAGAGCAGGACCTTGGTCTTTCCTTGTT
TGAAGCTTTGGGCTACGTGGATGACCAGCTGTTCGTGTTCTATGATCA
TGAGAGTCGCCGTGTGGAGCCCCGAACTCCATGGGTTTCCAGTAGAA
TTTCAAGCCAGATGTGGCTGCAGCTGAGTCAGAGTCTGAAAGGGT
GGGATCACATGTTCACTGTTGACTTCTGGACTATTATGGAAAATCACAA
CCACAGCAAGGAGTCCCACACCCTGCAGGTCATCCTGGGCTGTGAA
ATGCAAGAAGACAACAGTACCGAGGGCTACTGGAAGTACGGGTAT
GATGGGCAGGACCACCTTGAATTCTGCCCTGACACACTGGATTGGAG
AGCAGCAGAACCCAGGGCCTGGCCCACCAAGCTGGAGTGGGAAAG
GCACAAGATTCGGGCCAGGCAGAACAGGGCCTACCTGGAGAGGGAC
TG
68
69
A Portion of the HFE DNA Sequence
GCACAAGATTCGGG
GGACAAGATTCGGG
His: CAU and CAC
Asp: GAU and GAC
A Mutation in position 225 – changes C to G.
Changes a part of the protein. (histadine to aspartic acid at
position 63)
70
Amino Acid Sequence for HFE
MGPRARPALLLLMLLQTAVLQGRLLRSHSLHYLFMGASEQ
DLGLSLFEALGYVDDQLFVFYD H
D ESRRVEP
RTPWVSSRISSQMWLQLSQSLKGWDHMFTVDFWTIME
NHNHSKESHTLQVILGCEMQEDNSTEGYWKYGY
DGQDHLEFCPDTLDWRAAEPRAWPTKLEWERHKIRAR
QNRAYLERDCPAQLQQLLELGRGVLDQQVPPLV
KVTHHVTSSVTTLRCRALNYYPQNITMKWLKDKQPMDA
KEFEPKDVLPNGDGTYQGWITLAVPPGEEQRY
TCQVEHPGLDQPLIVIWEPSPSGTLVIGVISGIAVFVVILFI
GILFIILRKRQGSRGAMGHYVLAERE
His63Asp in ONE chromosomes
Cys282Tyr in ONE chromosome (not shown)
71
Report Back from Full Sequence Lab

Reference sequences for transcript
variant 1 for the HFE gene. NM_000410 ;
NP_000401



Consensus CDS (CCDS) CCDS4578.1
Mutant phenotype changes:
• His63Asp; Cys282Tyr
(2 mutations)
Polymorphisms noted:
• AA position 59 VAL53MET [157GA (freq 5%)]
72
Special health concerns HFE


For person with dx:
For family members:
73
Dilemmas





The reference sequence ties you to external
data sources that change
The protein has eleven transcript variants
Mutant phenotype is noted as an amino acid
change
Polymorphisms are noted as nucleotide
change
These results have implications for other family
members in addition to the patient
74
What Should You Store in the EMR?




Do you put the DNA sequence for the gene
into the EMR? Where do you put it?
Do you just store meta-data about the DNA
sequence? HFE test abn or (his63asp;
cys282tyr) What about the normal variants?
If you don’t store the sequence, what do you
do when the reference sequence changes?
How do you trigger alerts and reminders? And
for what? People with hemochromatosis need
special screening and check-ups.
75
Genetic data in electronic medical
records?


Implications for component systems:
•
•
•
•
Laboratory
Pharmacy
Computerized order entry
Documentation and notes
Knowledge management
•
•
•
Alerts and reminders
Finding patients matching profiles
Practice guidelines and clinical trials
76
Genome Data and Other
Information Systems



Genomic information will be pervasive in all
healthcare information systems.
Also in public health systems
•
•
•
•
•
Newborn screening
Tissue and organ banks
DOD requires DNA samples
Bioterrorism and homeland security
Identification of World Trade Center victims
Privacy and security issues will remain with
us always but are manageable.
77
Summary


Informatics is the enabler of
personalized, genomic medicine.
Personalized medicine requires a
combination of medical informatics and
applied bioinformatics (and a lot more).
78
Informatics will be a very
dynamic discipline for eons
to come!
Your week at Wood’s Hole
is the first step to an
exciting future.
79
The End
Joyce Mitchell, PhD
University of Utah
80
Download