Introductory Speaker, Jonathan Pevsner: "Genomics, Bioinformatics

advertisement
Genomics, Bioinformatics
and the Revolution in Biology
Jonathan Pevsner, Ph.D.
Kennedy Krieger Institute/
Johns Hopkins School of Medicine
Outline
Three views of bioinformatics and genomics
Informatics
From small to large
From genotype to phenotype
The chromosomes
SNPs, HapMap, and the 1000 Genomes project
Definitions of bioinformatics and genomics
• Bioinformatics is the interface of biology and computers.
It is the analysis of proteins, genes and genomes
using computer algorithms and databases.
• Genomics is the analysis of genomes, including the
nature of genetic elements on chromosomes.
The tools of bioinformatics are used to make
sense of the billions of base pairs of DNA
that are sequenced by genomics projects.
• Genetics is the study of the origin and expression of
individual uniqueness.
Three views of bioinformatics and genomics
1. The field of informatics
2. From small to large
3. From genotype to phenotype
bioinformatics
medical
informatics
genomics
Tool-users
public health
informatics
Tool-makers
algorithms
databases
infrastructure
Three views of bioinformatics and genomics
1. The field of informatics
2. From small to large
3. From genotype to phenotype
DNA
RNA
protein
phenotype
Rapid growth of DNA sequences
200
180
160
140
120
100
80
60
40
20
0
1982
Total number of DNA base
pairs in GenBank/WGS
Base pairs (billions)
Sequences (millions)
1992
Year
2002 2008
Time of
development
Body region, physiology,
pharmacology, pathology
The Origin of Species (1859)
It is interesting to contemplate a tangled bank, clothed with
many plants of many kinds, with birds singing on the
bushes, with various insects flitting about, and with worms
crawling through the damp earth, and to reflect that these
elaborately constructed forms, so different from each other,
and dependent upon each other in so complex a manner,
have all been produced by laws acting around us.
Source: Origin of Species, Chapter 15
Eukaryotes
(Baldauf et al. 2000)
fungi
animals
slime
mold
plants
Paramecium
Plasmodium
Trypanosoma
Giardia
Trichomonas
Wolfe et al. (1999)
8 chromosomes
(5,000 genes)
16 chromosomes
(10,000 genes)
16 chromosomes
(6,000 genes)
Wolfe et al. (1999)
Paramecium tetraurelia: a ciliate with two nuclei, 40,000 genes, and
three whole-genome duplications
Phylogenetic
footprinting
Phylogenetic
shadowing
Population
shadowing
Three views of bioinformatics and genomics
1. The field of informatics
2. From small to large
3. From genotype to phenotype
DNA
RNA
protein
pathway
cell
organism
population
DNA
RNA
protein
We see 500 inpatients and 13,000
outpatients per year at the Kennedy
Krieger Institute. Why do children engage
in self-injurious behavior? In many cases,
there are chromosomal insults.
pathway
cell
organism
population
Phenotype
DNA
From genotype…
RNA
protein
pathway
cell
organism
population
…to phenotype
DNA
RNA
DNA
RNA
protein
pathway
cell
cellular phenotype
organism
clinical phenotype
population
protein
DNA
RNA
DNA
RNA
protein
protein
pathway
cell
organism
population
Central dogma of molecular biology:
DNA is transcribed into RNA,
and translated into protein.
Central dogma of bioinformatics/genomics:
the genome is transcribed into the transcriptome,
and translated into the proteome.
DNA
200
180
160
RNA
140
120
100
80
protein
60
40
20
0
pathway
1982
1992
2002
2008
cell
organism
population
Over 200 billion base pairs of DNA have now
been sequenced, from >165,000 organisms.
DNA
RNA
protein
pathway
cell
organism
population
Scope of bioinformatics
Sequence analysis
Pairwise alignment
Multiple sequence alignment
Phylogeny
Database searching (e.g. BLAST)
Functional genomics
RNA studies; gene expression profiling
Proteomics; protein structure
Gene function
Pairwise alignments in the 1950s
b-corticotropin (sheep)
Corticotropin A (pig)
Oxytocin
Vasopressin
ala gly glu asp asp glu
asp gly ala glu asp glu
CYIQNCPLG
CYFQNCPRG
globins: a- b-
myoglobin
Early example of sequence
alignment: globins (1961)
H.C. Watson and J.C.
Kendrew, “Comparison
Between the Amino-Acid
Sequences of Sperm Whale
Myoglobin and of Human
Hæmoglobin.” Nature
190:670-672, 1961.
LAGAN
2e Fig. 5.21
Multiple sequence alignment of five globins:
ClustalW
Praline
MUSCLE
Probcons
TCoffee
DNA
RNA
protein
pathway
cell
organism
population
Scope of bioinformatics
Sequence analysis
Pairwise alignment
Multiple sequence alignment
Phylogeny
Database searching (e.g. BLAST)
Functional genomics
RNA studies; gene expression profiling
Proteomics; protein structure
Gene function
DNA
RNA
protein
pathway
cell
organism
Four bases: A, G, C, T arranged in
base pairs along a double helix (1953).
population
Human genome project: sequencing all
~3 billion base pairs (2003).
DNA
RNA
protein
pathway
cell
organism
population
1995: first genome sequence (a bacterium)
2000: fruit fly genome, plant
2003: human genome
2008: --two individual human genomes finished
--1,000 human genomes (launched)
--SNPs used to study chromosomes
DNA
RNA
protein
pathway
cell
organism
population
DNA
RNA
protein
pathway
cell
organism
population
DNA
RNA
protein
Time of
development
pathway
cell
organism
Body region, physiology,
pharmacology, pathology
population
DNA
RNA
protein
pathway
cell
organism
population
DNA
Genotype
RNA
protein
pathway
cell
organism
population
Phenotype
Outline
Three views of bioinformatics and genomics
Informatics
From small to large
From genotype to phenotype
The chromosomes
SNPs, HapMap, and the 1000 Genomes project
Eukaryotic genomes are organized
into chromosomes
Genomic DNA is organized in chromosomes. The diploid
number of chromosomes is constant in each species
(e.g. 46 in human). Chromosomes are distinguished by a
centromere and telomeres.
The chromosomes are routinely visualized by karyotyping
(imaging the chromosomes during metaphase, when
each chromosome is a pair of sister chromatids).
Fig. 16.19
Page 565
nucleolar organizing center
centromere
human chromosome 21
at NCBI
nucleolar organizing center
centromere
human chromosome 21
at www.ensembl.org
centromere
human chromosome 21
at UCSC Genome Browser
centromere
human chromosome 21
at UCSC Genome Browser
First P.G. mitosis in polar
view. Tradescantia
virginiana,
Commelinaceae, n = 9
(from aberrrant plant with
22 chromosomes). 2 BE CV smears. x 1200.
Printed on multigrade
paper.
Darlington.
First P.G. mitosis in Paris
quadrifolia, Liliaceae,
showing all stages from
prophase to telophase. n
= 10 (cf. Darlington 1937,
1941)
2 BE – CV smear, 8mm.
objective. x 800
Darlington.
Root tip squashes
showing anaphase
separation. Fritillaria
pudica, 3x = 39,
spiral structure of
chromatids revealed
by pressure after cold
treatment.
2 BD – Feulgen; x
3000
Darlington.
Cleavage mitosis in the
morula of the teleostean
fish, Coregonus clupeoides,
in the middle of anaphase.
Spindle structure revealed
by slow fixation. Section cut
at 10 u. x 4000. Strong
Flemming, haematoxylin.
Prep. and photo by P.C.
Koller.
Darlington.
The eukaryotic chromosome: Robertsonian fusion
creates one metacentric by fusion of two acrocentrics
ordinary male house mouse (Mus musculus, 2n = 40)
male tobacco mouse (Mus poschiavinus, 2n = 26)
Ohno (1970) Plate II
The spectrum of variation
Category of variation
Size
Single base pair changes 1 bp
type
SNPs,
point mutations
Small insertions/deletions 1 – 50 bp
Short tandem repeats
1 – 500 bp microsatellites
Fine-scale structural var. 50 bp – 5 kb del, dup, inv
tandem repeats
Retroelement insertions 0.3 – 10 kb SINEs, LINEs
LTRs, ERVs
Intermediate-scale struct. 5 kb – 50 kb del, dup, inv,
tandem repeats
Large-scale structural var. 50 kb – 5 Mb del, dup, inv, large
tandem repeats
Chromosomal variation
>>5Mb
aneuploidy
Adapted from Sharp AJ et al. (2006) Annu Rev Genomics Hum Genet
7:407-42
Across the genome, there
are four possible SNP calls:
[1] homozygous (AA)
[2] homozygous (BB)
[3] heterozygous (AB)
[4] no call
In a deleted region, there
are three possible SNP calls:
[1] A (interpreted as AA)
[2] B (interpreted as BB)
[3] no call
Across the genome, there
are four possible SNP calls:
[1] homozygous (AA)
[2] homozygous (BB)
[3] heterozygous (AB)
[4] no call
Single nucleotide polymorphisms (SNPs) to investigate
chromosomes: A case of 7p deletion
AA
AB
BB
A case of 7p deletion
A
B
AA
AB
BB
A case of 7p deletion
A
B
•Deletions (and duplications) such as these are
called copy number variants (CNVs).
• CNVs commonly occur in normal individuals.
• When found in individuals with disease, we
can tell if they are inherited (likely to be
benign) or occur de novo (more likely to be
disease-associated) by comparison to the
parents’ genotypes.
• Recent papers report many CNVs in disease.
A case of trisomy 21 (Down syndrome)
AAA
AAB
ABB
BBB
Three cases of 10q deletion
Deafness gene?
The International HapMap Project
► A catalog of common genetic variants that occur in humans
► The project’s goal is to compare the genetic sequences of
different individuals to identify chromosomal regions where
genetic variants are shared
► An initial focus has been on four groups (n=270):
CEU
European ancestry (30 trios)
Utah residents
YRI
African ancestry (30 trios)
Yoruba in Ibadan, Nigeria
JPT/CHB
Asian ancestry (90 individuals)
Japanese in Tokyo, Japan
Han Chinese in Beijing, China
► Phase I (2005): > 1 million SNPs
Phase II (2007): added 2.1 million SNPs
The International HapMap Project
► In addition to CEU, YRI, and JPT/CHB additional
populations have been genotyped including:
Maasai in Kinyawa, Kenya
Luhya in Webuye, Kenya
Gujarati Indians in Houston, TX
Toscani in Italy
Mexican ancestry in Los Angeles
African ancestry in southwestern US
The ENCODE project
►The ENCyclopedia Of DNA Elements (ENCODE) project
was launched in 2003
► Pilot phase: devise and test high-throughput approaches
to identify functional elements. Efforts center on 44 DNA
targets. These cover about 1 percent of the human genome,
or about 30 million base pairs.
► Second phase: technology development.
► Third phase: production. Expand the ENCODE project to
analyze the remaining 99 percent of the human genome.
The ENCODE project
Goal of ENCODE: build a list of all sequence-based functional
elements in human DNA. This includes:
► protein-coding genes
► non-protein-coding genes
► regulatory elements involved in the control of gene
transcription
► DNA sequences that mediate chromosomal structure and
dynamics.
ENCODE data at the UCSC Genome Browser: beta globin
HBB, HBD, HBG1,
HBG2, HBE1
ENCODE data at the UCSC Genome Browser: beta globin
(50,000 base pairs including HBB, HBD, HBG1, HBG2, HBE1)
ENCODE tracks available at the UCSC Genome Browser
EGASP: the human ENCODE Genome
Annotation Assessment Project
EGASP goals:
[1] Assess of the accuracy of computational methods to
predict protein coding genes. 18 groups competed to make
gene predictions, blind; these were evaluated relative to
reference annotations generated by the GENCODE project.
[2] Assess of the completeness of the current human
genome annotations as represented in the ENCODE
regions.
UCSC: tracks for Gencode and for various gene prediction algorithms
(focus on 50 kb encompassing five globin genes)
Gencode
JIGSAW
On bioinformatics
“Science is about building causal relations between natural
phenomena (for instance, between a mutation in a gene and
a disease). The development of instruments to increase our
capacity to observe natural phenomena has, therefore,
played a crucial role in the development of science - the
microscope being the paradigmatic example in biology. With
the human genome, the natural world takes an
unprecedented turn: it is better described as a sequence of
symbols. Besides high-throughput machines such as
sequencers and DNA chip readers, the computer and the
associated software becomes the instrument to observe it,
and the discipline of bioinformatics flourishes.”
On bioinformatics
“However, as the separation between us (the observers) and
the phenomena observed increases (from organism to cell
to genome, for instance), instruments may capture
phenomena only indirectly, through the footprints they leave.
Instruments therefore need to be calibrated: the distance
between the reality and the observation (through the
instrument) needs to be accounted for. This issue of
Genome Biology is about calibrating instruments to observe
gene sequences; more specifically, computer programs to
identify human genes in the sequence of the human
genome.”
Martin Reese and Roderic Guigó, Genome Biology 2006 7(Suppl I):S1,
introducing EGASP, the Encyclopedia of DNA Elements (ENCODE)
Genome Annotation Assessment Project
The 1000 Genomes Project
Goal: To create a deep catalog of human genetic
variation in multiple populations.
[1] Discover variants (SNPs, copy number variants,
insertions/deletions). Include ~all variants with allele
frequencies >1% across the genome (and >0.1-0.5%
in gene regions)
[2] Estimate the frequencies of variant alleles
The 1000 Genomes Project
Secondary goals:
• Characterize SNPs
• Improve the human reference sequence
• Study regions under selection
• Study variation across populations
• Study mutation and recombination
The 1000 Genomes Project
Current approaches include sequencing two HapMap
trios (one from YRI, one CEU; father/mother/child) at
20X depth using next generation sequencing
technology.
For one individual, 20X depth = 60 gigabases
For one trio, 20X depth = 180 gigabases
In another approach, sequence many individuals
(n=1000) from the extended HapMap collection at
lighter coverage.
Conclusions
We briefly surveyed the fields of bioinformatics and
genomics. Bioinformatics serves biology, and
genomics depends on the tools of bioinformatics.
There are rapid advances in available technologies,
such as next generation sequencing, that allow us to
address fundamental biological questions at
unprecedented resolution. These questions include
the nature of variation within and between genomes of
individuals, groups (gender, ethnicity, disease status),
and across species. Other questions, posed decades
ago, concern biological processes such as
development, metabolism, adaptation, and function.
Download