The Impact of the Human Genome Project on Clinical Diagnostics

advertisement
Introduction to Genomics and Proteomics Historical Perspective and the Future
Eleftherios P. Diamandis, M.D., Ph.D., FRCPC (C)
UNIVERSITY OF TORONTO
(Course 1505S/Jan. 9, 2001 #1)
Organization of the Lecture
Historical Background
The Human Genome Project
Critical Technologies:
• Massive, automated sequencing
• DNA and RNA analysis
• Mass spectrometry
• DNA and protein microarrays
• Bioinformatics
• Single nucleotide polymorphisms
Applications:
• Diagnostics
• Therapeutics
• Pharmacogenetics
Ethics
Patents
(Course 1505S/Jan. 9, 2001 #2)
Historical Milestones
Year
1866
1871
1951
1953
1960s
1977
1975-79
1986
1995
1999
2000
2001
Milestone
Mendel’s discovery of genes
Discovery of nucleic acids
First protein sequence (insulin)
Double helix structure of DNA
Elucidation of the genetic code
Advent of DNA sequencing
First cloning of human genes
Fully automated DNA sequencing
First whole genome (Haemophilus Influenza)
First human chromosome(Chr #22)
Drosophila / Arabidopsis genomes
Human and mouse genomes
(Course 1505S/Jan. 9, 2001 #3)
Terminology
DNA
Genomics
mRNA
Transcriptomics
Protein
Proteomics
Metabolites
Metabolomics
Functional genomics, proteomics ----- etc.
(Course 1505S/Jan. 9, 2001 #4)
History
On June 26, 2000, at The White House, it was announced
that the Human Genome Project was essentially completed
by Celera Genomics (private company)
The National Human Genome Research Initiative
and its International Partners (publicly funded)
Work has yet to be published but Celera scientists
submitted a paper to “Science” on December 6, 2000.
(Course 1505S/Jan. 9, 2001 #5)
History
On June 26, 2000, at The White House, it was announced
that the Human Genome Project was essentially completed
by Celera Genomics (private company)
The National Human Genome Research Initiative
and its International Partners (publicly funded)
Work has yet to be published but Celera scientists
submitted a paper to “Science” on December 6, 2000.
(Course 1505S/Jan. 9, 2001 #5)
Diagnostics / Prognostics
•
Does my DNA predispose me to a
specific disease?
•
Do I want to know? (Ethics)
•
Genetic mutations  disease  cancer
 diabetes
 Alzheimer’s
 heart disease
•
Whole genome scans for identification of
mutations/polymorphisms?
AACC2000-#2 - 1
Pharmacogenetics and Pharmacogenomics
Goal is to associate human sequence
polymorphisms with:
• Drug metabolism
• Adverse effects
• therapeutic efficacy

* Decrease drug development cost
* Optimize selection of clinical trial participants
* Increase patient benefit
AACC2000-#2 - 3
Critical Protein Technologies
Protein
•
•
•
•
•
•
•
Make pure form (recombinant)
Activity
Reagents (antibodies)
Identification (sequencing)
Identify post-translational modification
(glycation, phosphorylation, etc.)
Protein-protein interactions (physiological
function)
Gene  protein knockout / transgene
AACC2000 -#2 - 13
Models of Human Disease
•
Identify natural human knockouts
•
Develop mice with every gene (or gene
combination) being knocked out (this
project is now underway!)
AACC2000 -#2 -14
Expressed Sequence Tags (ESTs)
•
•
•
Cloned cDNAs from various tissues (cDNA
libraries)
Can search through by BLAST analysis
Can purchase them, fully sequence and
characterize them
Great help for new gene identification.
AACC2000 -#2 -16
Gene Patents
•
•
•
•
Gene fragments
Whole genes without function
Whole genes with function
Whole genes with function and utility
(enablement)
AACC2000 -#2 - 18
Where Do We Stand Today? (July 2000)
Public Consortium: 85% of Genome is done
* 24% finished form
* 22% near finished
* 38% draft
* rest is being done
Celera:
Claims to have more than 99% of
genome now!
Incyte:
They may have all the genes!
AACC2000 -#2 -25
Where Does the Individual Researcher Stand?
•
•
•
At the end of the day, each gene must be looked at in
great detail:
- structure
- function
- physiology/pathways
- pathophysiology
- connection to disease
- tools
Individual researchers can make the big discoveries
on a very specific gene or a very specific gene family
Great time for individual researchers
AACC2000 -#2 - 20
The Future of Genome Projects
Human

Mouse (just started)

Rat

Zebra Fish

Dog

Other Primates
*
The Era of Comparative Genomics
(you can learn a lot about humans by
studying the yeast, drosophila, mouse, etc.)
AACC2000 -#2 - 21
The Impact of the Human Genome Project in
Medicine
•
•
•
•
You can’t make a car if you are missing parts
Once all genes are known, we will start
understanding their function  PATHWAYS
We will then be able to correlate disease states to
certain genes (Pathobiology)
DISEASE  GENE (S)
GENE (S)  DISEASE
We will then find ways for rational treatments
(designer drugs), prevention, diagnosis……
AACC2000 -#2 - 22
Gene Manipulation (Ethics??)
Gene modulation ( regulation)
Gene repair
Gene excision
Gene replacement/transplantation
Gene improvement
AACC -#2 -23
Celera’s Whole Genome Shotgun Strategy
•
Doe not use BAC clones; cuts whole DNA
into millions of pieces which are sequenced
•
Computer assembles pieces together
•
Achieve high accuracy with X6 coverage
•
Lots of relatively short gaps
AACC -#2 - 26
Strategy to Sequence Human Genome
Construct a human genomic library in an appropriate vector
(BAC)
Assemble overlapping BAC clones in order to obtain full
coverage of the distance (restriction map)
DNA
BAC
Clones
Start sequencing each BAC until you finish the job
AACC -#2 - 27
How are these BACs Sequenced?
Shotgun Sequencing
BAC clone is broken down to small pieces which have
overlapping ends
Small pieces are sequenced and a computer assembles the
pieces based on the overlapping sequence information
Construct contigs (contiguous areas of sequence)
Larger contigs -----------------------AACC -#2 -28
Other Important Genomic Technologies
•
•
•
•
•
•
•
Recombinant DNA (cloning)
PCR
Pulsed Field Gel Electrophoresis (PFGE)
Chromosome microdissection
Somatic hybrid cell lines (mapping) [rodent x
human]
Radiation hybrid cell lines [rodent x human]
DNA sequencing
AACC2000- #2- 32
Annotation
What is annotation?
Make sense out of a linear sequence  identify
genes, intron/exon boundaries, regulatory
sequences, predict protein structure, identify motifs,
predict function, etc.
Annotation will likely go on for a few years.
Major annotation tool  BIOINFORMATICS
(hardware & software)
Celera Genomics
•
•
The publicly funded project started around 1990 with
a goal to produce a highly accurate sequence by 2005
Celera started in 1998 and within 2 years sequenced
more DNA than the publicly funded consortium!
Why?
•
•
•
•
•
•
No bureaucracy
Facility (300 sequencers x 24h/day)
Powerful supercomputer
Lots of money
More efficient sequencing approach (no BACs
necessary)
Use of data from the publicly funded project
Cloning Vectors
•
Replicable units of DNA which can carry
exogenously inserted DNA; size of insert varies with
vector type:
• plasmid
5-10 kb
• l phage
20 kb
• cosmid
45 kb
PAC/BAC (P1- or bacterial artificial chromosome) 100 200 kb
YAC (yeast artificial chromosome) 1,000 kb
AACC2000- #2- 31
Human Genome
•
•
•
•
3 x 109 base pairs
Approximately 100,000 genes
< 10% of DNA encodes for genes; the
rest represents introns/repetitive elements
Importance of non-coding sequences
currently not understood
AACC2000 -#2 -33
Quality of Sequencing
•
Clones are sequenced more than once to verify
the sequence many times:
x 4  rough draft  1 error per 100 bases
x 8-11  finished draft  1 error per 10,000
bases
AACC2000 -#2 -34
The Next Race
•
It will not be who has the sequence
•
It will be how you can use the sequence to
arrive at products
* DIAGNOSTICS
* THERAPEUTICS
AACC2000 -#2- 35
Genomics and Drug Discovery
Genomic technologies are involved in all aspects of
the drug discovery process from target validation
though to the marketed drug, which include:
•
•
•
•
•
Molecular target identification
Drug target characterization and validation
Lead discovery
Lead optimization
Clinical candidate to marketed drug
AACC2000- #2- 37
Key Corporate Players in Proteomics
Compay
Location
Approach
Celera
Rockville, MD
Databases
Incyte Pharmaceuticals
Palo Alto, CA
Databases
GeneBio
Geneva, Switzerland Databases
Proteome Inc.
Beverly, MA
Databases
PE Biosystems
Framingham, MA
Instrumentation
Ciphergen Biosystems
Palo Alto, CA
Protein arrays
Oxford GlycoSciences
Oxford, UK
2D gel/MS*
Protana
Odense, Denmark
2D gel/MS
Genomic Solutions
Ann Arbor, MI2D gel/MS
Large Scale Proteomics Corp. Rockville, MD
2D gel/MS
__________________________________________________
____
* 2D gel electrophoresis and mass spectrometry
AACC2000- #2-381
Pharmacogenetics and Pharmacogenomics in Drug Discovery
_______________________________________________________
Aspect of Drug Development
Approach
_______________________________________________________
Drug-drug interactions
Examine polymorphism in metabolic
enzymes
Efficacy
Differentiate responders from
nonresponders
Side Effects
Examine variation in gene or genes
involved in mediating the effects (may
be mechanism related or unrelated)
Toxicity
Gene expression profiling in cells
treated with compound. Look for
toxicity signatures.
AACC2000- #2- 39
The Biography of the Year 2000
(Francis Collins and J.Craig Venter)
Creating an Array of Contigous BAC
Clones
The ….omics
Introduction to Genomics and Proteomics Historical Perspective and the Future
Eleftherios P. Diamandis, M.D., Ph.D., FRCPC (C)
UNIVERSITY OF TORONTO
(Course 1505S/Jan. 9, 2001 #1)
Organization of the Lecture
Historical Background
The Human Genome Project
Critical Technologies:
• Massive, automated sequencing
• DNA and RNA analysis
• Mass spectrometry
• DNA and protein microarrays
• Bioinformatics
• Single nucleotide polymorphisms
Applications:
• Diagnostics
• Therapeutics
• Pharmacogenetics
Ethics
Patents
(Course 1505S/Jan. 9, 2001 #2)
Historical Milestone
Year
1866
1871
1951
1953
1960s
1977
1975-79
1986
1995
1999
2000
2001
Milestone
Mendel’s discovery of genes
Discovery of nucleic acids
First protein sequence (insulin)
Double helix structure of DNA
Elucidation of the genetic code
Advent of DNA sequencing
First cloning of human genes
Fully automated DNA sequencing
First whole genome (Haemophilus Influenza)
First human chromosome
Drosophila / Arabidopsis genomes
Human and mouse genomes
(Course 1505S/Jan. 9, 2001 #3)
Technologies
DNA
Genomics
mRNA
Transcriptomics
Protein
Proteomics
Metabolites
Metabolomics
Functional genomics, proteomics ----- etc.
(Course 1505S/Jan. 9, 2001 #4)
History
On June 26, 2000, at The White House, it was announced
that the Human Genome Project was essentially completed
by Celera Genomics (private company)
The National Human Genome Research Initiative
and its International Partners (publicly funded)
Work has yet to be published but Celera scientists
submitted a paper to “Science” on December 6, 2000.
(Course 1505S/Jan. 9, 2001 #5)
Predicting the Future
What is going to happen now that the human
and other genomes are completed?
How quickly the next steps will happen?
What are the potential difficulties?
Are we expecting too much?
(Course 1505S - Jan. 15/01 - #6)
Grand Plan
Find all the genes
Translate genes to proteins
“Compute” function by similarity search and
comparison to known proteins
“Compute” structure
(Course 1505S - Jan. 15/01 - #7)
Difficulties
•
Gene prediction programs are unreliable
•
Function inference by just similarity search
may be fallacious
•
Computation of structure is still unreliable
Our databases may get contaminated with
“wrong” information.
(Course 1505S - Jan. 15/01 - #8)
Gene Prediction
•
Programs were designed based on knowledge of
already cloned genes (ORFs; splice sites; start/stop
codons, etc.)
•
These programs provide excellent clues for gene
presence but they never or rarely predict the
complete gene structure
•
The computer prediction must be taken as a “starting
point” to experimentally clone a gene
How many genes in the genome?
Estimate: 27,462 to 312, 278!
(Course 1505S - Jan. 15/01 - #9)
What is a Gene?
•
Heritable unit corresponding to a phenotype?
•
DNA that encodes for a protein?
•
DNA that encodes RNA?
•
What if RNA is not translated?
•
What if a “gene” is not expressed?
(Course 1505S - Jan. 15/01 - #10)
Prediction of Function
What is function? This is not a simple term
Function may be: • a biological process (e.g. serine
protease activity)
• a molecular event (e.g. proteolysis of
a specific substrate)
• a cellular structure (e.g. membrane;
chromatin; mitochondrion; etc.)
• relevance to a whole process (e.g. cell
cycle)
• relevance to the whole organism (e.g.
ovulation)
* Some scientists have now initiated projects to “compute”
function of whole organisms.
(Course 1505S - Jan. 15/01 - #11)
Pattern Recognition
•
Looks for motifs that may have functional
relevance (family signatures):
* Membrane anchoring
* Catalytic site
* Nucleotide binding
* Nuclear localization signal
* Hormone response element
* Calcium binding, etc.
•
Protein family resources (being created now)
(Course 1505S - Jan. 15/01 - # 12)
Homology
•
What is “homology”?
Definition: Two proteins are homologous if
they are related by divergence from a common
ancestor.
B
Divergent
A
C
Evolution
Ancestor
D
Homologous
(Course 1505S - Jan. 15/01 - # 13)
Analogy
•
What is “analogy”?
Definition: Two proteins are “analogous” if they
acquired common structural and functional features
via convergent evolution from unrelated ancestors.
Convergent
A
B
Evolution
C
Unrelated
D
Analogous (similar
structure and/or function)
(Course 1505S - Jan. 15/01 - # 14)
Serine Proteases (Convergent Evolution)
Trypsin-like
Subtilisin-like
Many homologous
members
Many homologous
members
Analogous
proteins
Trypsin and subtilisin share groups of catalytic residues
with almost identical spatial geometries but they have no
other sequence or structural similarities.
(Course 1505S - Jan. 15/01 - # 15)
Human Kallikrein Gene Family
(Divergent Evolution)
15 homologous genes on human chromosome
19q13.4
Divergence in tissue expression and substrate
specificity
(Course 1505S - Jan. 15/01 - # 16)
Orthologs
Proteins that usually perform same function in
different species (e.g. DNA polymerase; glucose 6phosphate dehydrogenase; retinoblastoma gene;
p53, etc.).
Paralogs
Proteins that perform different but related functions
within one organism [usually formed by gene
duplication and divergent evolution] (e.g. the 15
kallikrein genes mentioned above).
(Course 1505S - Jan. 15/01 - # 17)
Functional Annotation - Difficulties
•
Who knows if the best matches in a database
query is really Orthologs or Paralogs
•
Modules: Building blocks of proteins. Finding
a “module” in a protein does not mean that a
“function” can be assigned since these
modules do not always perform the same
function
Aphorism: The properties of a system can be
explained by, but not deduced from those of its
components
(Course 1505S - Jan. 15/01 - # 18)
Structure Prediction
•
How proteins fold in 3D space
•
We still cannot reliably “compute” structures
of > 100 amino acid proteins (ab initio
methods)
•
Experiment and computation:
Crystallography  NMR
(Course 1505S - Jan. 15/01 - # 19)
Future
•
Lots of rigorous work needs to be done
•
Holistic view -- regulation of gene expression
-- metabolic pathways
-- signaling cascades
Remember: Proteins do not work in isolation but
within integrated networks.
(Course 1505S - Jan. 15/01 - # 20)
The Importance of Accurate Functional
Annotation
•
•
•
Function in whole organisms is complex and
interrelated
Need for close collaboration between:
- software developers
- annotators
- experimentalists
Holistic approaches needed for optimal
knowledge-based inference and innovation
(drugs, diagnostics, etc.)
(Course 1505S - Jan. 15/01 - # 21)
How proten Structure is Elucidated
Protein Annotation
Protein Annotation
PLANT GENOMES
Species
Brassicas
Thale cress
Genome Size
(base pairs)
Arabidoopsis
1.0 x 108
thaliana
-------------------------------------------------------------------------------------Oilseed rape/
Brassica napus
1.2 x 109
canola
-------------------------------------------------------------------------------------Cereals
Rice
Oryza sativa
4.2 x 108
Barley
Hordeum vulgare
4.8 x 109
Wheat
Triticum aestivum
1.6 x 1010
Maize/corn
Zea mays
2.5 x 109
------------------------------------------------------------------------------------Legumes
Garden pea
Pitsum sativum
4.1 x 109
Soya bean
Glycine max
1.1 x 109
------------------------------------------------------------------------------------Solanaceae
Potato
Solanum
1.8 x 109
tuberosum
Tomato
Lycopersicon
1.0 x 109
esculentum
------------------------------------------------------------------------------------Human
Homo sapiens
3.2 x 109
Centromere
5 kb
4.6 kb
ACPT
TAPACPT
5.8 kb
5.8 kb
5.1 kb
9.5 kb
10.5 kb
KLK15
PSA
KLK2
KLK-L1
KLK-L2
Zyme
HSCCE
KLK1
KLK15
KLK3
KLK2
KLK4
KLK5
KLK6
KLK7
1.5 kb
23.3 kb
13.3 kb
26.7 kb
31.9 kb
5.9 kb
5.7 kb
7.1 kb
5.4 kb
5.3 kb
5.8 kb
8.9 kb
6.3 kb
Neuropsin
KLK-L3
NES1
TLSP
KLK-L5
KLK-L4
KLK-L6
KLK8
KLK9
KLK10
KLK11
KLK12
KLK13
KLK14
2.1 kb
4.5 kb
3.4 kb
1.6 kb
21.3 kb
12.9 kb
6.3 kb
//
12.1 kb
5.4 kb
SIGLEC-9
43.2 kb
The New Human Kallikrein Gene Locus
(19q13.4, 300kb)
Revised November 29, 2000
6.5 kb
KLK1
23.6kb
//
6.2 kb
Download