The Human Genome: What's In It and How Do We Know?

advertisement
The Human Genome
What’s in it?
How do we know?
Gary Benson
Department of Computer Science
Department of Biology
Program in Bioinformatics
Boston University
Outline of Talk
• Protein Genes
• SNPs
• Haplotypes
• Finding a Disease Locus
Size of the Genomes
bacteria
E. coli
yeast
S. cerevisiae
round worm
C. elegans
Drosophila
fruit fly
flowering plant
Arabidopsis
Rice
Maize
Human
0
500
1000
1500
2000
Millions of Basepairs
2500
3000
3500
The Human Genome
What the letters stand for
DNA has four chemical subunits, called nucleotide bases
abbreviated A, C, G, T.
GATTACA
http://en.wikipedia.org/wiki/Nucleotide
What’s in the Genome?
• Chromosomes – 23 pairs
– Genes
• Protein genes
• RNA genes
• MicroRNA genes
– Repeats
• Tandem repeats
• Inverted repeats
• Transposons
• Segmental duplications
– Regulatory regions
• Promoters
• Transcription factor binding sites
Protein Genes
A protein gene contains the genetic code for a protein. The
production of protein involves transcription (copying DNA to
RNA) and translation (using RNA code to produce a protein).
http://www.slic2.wsu.edu:82/hurlbert/micro101/images/TransTranscrip.gif
Transcription
http://users.rcn.com/jkimball.ma.ultranet/BiologyPages/M/Mille
r_Beatty3.jpg
Translation
http://nobelprize.org/medicine/educational/dna/a/transla
tion/polysome_em.html
Finding Protein Genes
Before the sequencing of genomes, protein genes were found
experimentally. Now, new genes are predicted computationally
using a gene model.
Finding Protein Genes
Before the sequencing of genomes, protein genes were found
experimentally. Now, new genes are predicted computationally
using a gene model.
Finding Protein Genes
Before the sequencing of genomes, protein genes were found
experimentally. Now, new genes are predicted computationally
using a gene model.
Finding Protein Genes
Before the sequencing of genomes, protein genes were found
experimentally. Now, new genes are predicted computationally
using a gene model.
Finding Protein Genes
Before the sequencing of genomes, protein genes were found
experimentally. Now, new genes are predicted computationally
using a gene model.
Building a Gene Model
Gene models for prediction are based on the structure of genes in
DNA and their messenger RNAs (mRNAs). This includes exons,
introns, promoters, and the polyadenylation signal.
http://xray.bmc.uu.se/Courses/Bke2/Exercises/Exercise_answers/pre_mRNA_processing.gif
Exons
In this example, EXONS are uppercase and introns are lowercase.
Exons contain the code for a protein, introns interrupt the exons.
Before translation, introns are removed from the messenger
RNA.
DNA:
…ACTGCTACAGtctattgaGAACAACATAGtcacgaacttaacgtgca
GTTTAACAGCACGtctcgaagggca…
RNA (before removal of introns):
…ACUGCUACAGucuauugaGAACAACAUAGucacgaacuuaacg
ugcaGUUUAACAGCACGucucgaagggca…
RNA (after removal of introns):
…ACUGCUACAGGAACAACAUAGGUUUAACAGCACG…
Finding Exons
The sequence of an exon contains codons. Each codon is a
triplet of nucleotides which codes for a single amino acid. Amino
acids are the building blocks of a protein.
http://en.wikipedia.org/wiki/Genetic_code
Genetic Code
. Each codon specifies one of twenty amino acids. Three codons
are stop codons, which specify the end of translation.
http://www.emc.maricopa.edu/faculty/farabee/BIOBK/code.gif
Open Reading Frame (ORF)
An open reading frame (ORF), is a sequence of codons that does
not contain a stop codon.
alanine
threonine
glutamic acid
leucine
arginine
serine
STOP!
http://en.wikipedia.org/wiki/Genetic_code
Finding Exons
Sequence:
acggacucuagccuaaugugacgacugacauagguaaauucgcuc
Even though this sequence contains stop codons, they are not
present in all reading frames.
frame +1
acg gac ucu agc cua aug uga cga cug aca uag gua aau ucg cuc
frame +2
a cgg acu cua gcc uaa ugu gac gac uga cau agg uaa auu cgc uc
frame +3
ac gga cuc uag ccu aau gug acg acu gac aua ggu aaa uuc gcu c
Very short ORFs are unlikely.
Finding Introns
Introns usually start at a G – T boundary and end at an A – G
boundary.
Finding Exons
Sequence:
acggacucuagccuaaugugacgacugacauagguaaauucgcuc
A gene can contain open reading frames connected across stop
codons by an intron
frame +1
acg gac ucu agc cua aug uga cga cug aca uag gua aau ucg cuc
frame +3
ac gga cuc uag ccu aau gug acg acu gac aua ggu aaa uuc gcu c
How many genes are there?
Estimates
• pre 2000: 100,000 based on estimates of required number of
genes to account for human complexity
• 2001: 30,000 – 40,000 based on first draft of human genome
• 2003: 23,000 – 24,500 based on gene prediction computer
programs
Why so low?
• alternate splicing of exons
• complex regulatory mechanisms
• inability to predict genes which are unlike those seen before
http://www.ornl.gov/sci/techresources/Human_Genome/faq/genenumber.shtml
RNA Genes
RNA genes do not code for proteins. Instead, the RNA molecule
itself is functional in the cell.
Examples include:
1. Ribosomal RNA – these molecules form the major
component of the protein building machinery
2. Transfer RNA – work with ribosomal RNA to insert correct
amino acids into growing proteins
3. MicroRNA – a newly discovered class of RNA which helps
regulate gene expression.
Ribosome
http://www.ncbi.nl
m.nih.gov/Class/N
AWBIS/Modules/R
NA/images/fig_rna
12.jpg
Transcription
http://users.rcn.com/jkimball.ma.ultranet/BiologyPages/M/Mille
r_Beatty3.jpg
Translation
http://nobelprize.org/medicine/educational/dna/a/transla
tion/polysome_em.html
RNA Genes
MicroRNAs are short and show little or no conservation of
sequence.
Unlike protein genes, RNA genes do not contain codons or open
reading frames. But, they do contain inverted repeats.
Inverted Repeats (IRs)
RNA
reversed
GACUUGA
U C AA G U C
complemented
Two patterns, one the reverse complement of the other
IR Nomenclature
RNA
Right arm
Left arm
GACUUGA
U C AA G U C
Spacer
Stem-Loop Structure
Spacer
Left arm
C
A
G
U
U
C
A
G
G
U
C
A
A
G
U
C
Right arm
Structure forms by pairing of complementary bases
MicroRNA
MicroRNAs come from a precursor that contains a stem-loop.
http://www.ma.uni-heidelberg.de/apps/zmf/argonaute/interface/mirna.jpeg
Detection of Approximate Inverted Repeats
Human Chr. 3 ~173,291,101
AAGACTTGAA
AGTAGATTCC
CACAGTCCCA
TCTGATTATA
TATGGATTCC
ACTGTGTCTG
AATATGTGTA
AAGAGAGAGG
GCATTTCCCC
CAACTTTTAA
CTTTTTCATT
GTTTCTACCT
ACACTCATTG
TTACTGCATC
TGAGAATGTG
TTTTTGAGTG
AAGATTAGGT
CTACGT
ACATAAGATC
CACAATCACA
GACTGAGATG
ATTATAACAC
TCATTCAGGT
ATTGTGAATG
TCTATGGAAG
AAAATGAAAT
AATTATTTCA
TTCTCACAGA
CAGTAAGGAA
TCATTGAATT
AGAAAAAGGG
AAAAAGATGG
AGCTTCTGAC
ATCGCCGTCG
Detection of Approximate Inverted Repeats
Human Chr. 3 ~173,291,101
AAGACTTGAA CAACTTTTAA ACATAAGATC AATTATTTCA
AGTAGATTCC CTTTTTCATT CACAATCACA TTCTCACAGA
CACAGTCCCA GTTTCTACCT GACTGAGATG CAGTAAGGAA
TCTGATTATA ACACTCATTG ATTATAACAC TCATTGAATT
TATGGATTCC TTACTGCATC TCATTCAGGT AGAAAAAGGG
ACTGTGTCTG TGAGAATGTG ATTGTGAATG AAAAAGATGG
AATATGTGTA TTTTTGAGTG TCTATGGAAG AGCTTCTGAC
AAGAGAGAGG AAGATTAGGT AAAATGAAAT ATCGCCGTCG
GCATTTCCCC CTACGT
Arms are 72 nt long, spacer is 42bp long
The Problem: Find the Inverted Repeat
Human Chr. 3 ~173,291,101
AAGACTTGAA
AGTAGATTCC
CACAGTCCCA
TCTGATTATA
TATGGATTCC
ACTGTGTCTG
AATATGTGTA
AAGAGAGAGG
GCATTTCCCC
CAACTTTTAA
CTTTTTCATT
GTTTCTACCT
ACACTCATTG
TTACTGCATC
TGAGAATGTG
TTTTTGAGTG
AAGATTAGGT
CTACGT
ACATAAGATC
CACAATCACA
GACTGAGATG
ATTATAACAC
TCATTCAGGT
ATTGTGAATG
TCTATGGAAG
AAAATGAAAT
AATTATTTCA
TTCTCACAGA
CAGTAAGGAA
TCATTGAATT
AGAAAAAGGG
AAAAAGATGG
AGCTTCTGAC
ATCGCCGTCG
Single Nucleotide Polymorphisms (SNPs)
A SNP is a single position in the genome (a locus) that is not the
same in all people. Some people have one type of nucleotide and
other people have a different nucleotide. Differences in the
population at a single locus are called polymorphisms and the
individual types are called alleles.
a
a
c
c
g
a
t
t
t
SNPs
t
a
c
t
c
t
t
SNPs are found experimentally
Haplotypes
A haplotype is a collection of SNP alleles on a single
chromosome in an individual.
Shown are SNPS on two chromosomes in each individual.
a
c
a
t
t
c
a
t
a
t
a
g
t
c
c
a
a
c
g
t
t
c
a
t
a
c
a
g
t
c
c
a
t
c
g
t
t
c
a
t
a
c
a
t
t
c
c
t
a
c
a
g
a
t
a
t
a
c
a
t
t
c
a
a
t
c
a
t
t
c
a
t
a
c
a
t
t
c
c
t
Haplotypes
A haplotype is a collection of SNP alleles on a single
chromosome in an individual.
a
c
a
t
t
c
a
t
a
t
a
g
t
c
c
a
Homozygous (same alleles)
a a
t a
c c
c c
g a
g a
t g
t t
t t
t t
c c
c c
a c
a c
t a
t t
a
c
a
g
a
t
a
t
a
c
a
t
t
c
a
a
t
c
a
t
t
c
a
t
a
c
a
t
t
c
c
t
Haplotypes
A haplotype is a collection of SNP alleles on a single
chromosome in an individual.
a
c
a
t
t
c
a
t
a
t
a
g
t
c
c
a
Heterozygous (different alleles)
a a
t a
a
c c
c c
c
g a
g a
a
t g
t t
g
t t
t t
a
c c
c c
t
a c
a c
a
t a
t t
t
a
c
a
t
t
c
a
a
t
c
a
t
t
c
a
t
a
c
a
t
t
c
c
t
Haplotypes
A haplotype is a collection of SNP alleles on a single
chromosome in an individual.
a
c
a
t
t
c
a
t
a
t
a
g
t
c
c
a
a
c
g
t
t
c
a
t
a
c
a
g
a
c
c
a
Rare alleles
t a
c c
g a
t t
t t
c c
a c
t t
a
c
a
g
a
t
a
t
a
c
a
t
t
c
a
a
t
c
a
t
t
c
a
t
a
c
a
t
t
c
c
t
Haplotypes
A haplotype is a collection of SNP alleles on a single
chromosome in an individual.
a
c
a
t
t
c
a
t
a
t
a
g
t
c
c
a
Strong linkage (usually occur together)
a a
t a
a a
c c
c c
c c
g a
g a
a a
t g
t t
g t
t t
t t
a t
c c
c c
t c
a c
a c
a a
t a
t t
t a
t
c
a
t
t
c
a
t
a
c
a
t
t
c
c
t
Linkage Analysis
SNPs and haplotypes are used to identify regions of the genome
that cause disease. The technique is called linkage analysis and
evidence of a connection is called linkage disequilibrium (LD).
mom
a
c
a
t
t
c
a
t`
a
t
a
g
t
c
c
a
recombination and
inheritance
a
c
a
g
t
c
c
a
`
a
c
a
g
a
c
a
t
child
a
c
a
g
a
t
a
t
t
c
a
t
t
c
a
t
dad
Linkage Analysis
SNPs and haplotypes are used to identify regions of the genome
that cause disease. The technique is called linkage analysis and
evidence of a connection is called linkage disequilibrium (LD).
mom
a
c
a
t
t
c
a
t`
a
t
a
g
t
c
c
a
recombination in
the mother’s
chromosomes
a
c
a
g
t
c
c
a
`
a
c
a
g
a
c
a
t
a
c
a
g
a
t
a
t
child
t
c
a
t
t
c
a
t
dad
Linkage Analysis
SNPs and haplotypes are used to identify regions of the genome
that cause disease. The technique is called linkage analysis and
evidence of a connection is called linkage disequilibrium (LD).
mom
a
c
a
t
t
c
a
t`
a
t
a
g
t
c
c
a
a
c
a
g
t
c
c
a
`
a
c
a
g
a
c
a
t
a
c
a
g
a
t
a
t
child
t
c
a
t
t
c
a
t
dad
recombination in
the father’s
chromosomes
Linkage Analysis
SNPs and haplotypes are used to identify regions of the genome
that cause disease. The technique is called linkage analysis and
evidence of a connection is called linkage disequilibrium (LD).
mom
a
c
a
t
t
c
a
t`
a
t
a
g
t
c
c
a
a
c
a
g
t
c
c
a
`
a
c
a
g
a
c
a
t
a
c
a
g
a
t
a
t
child
t
c
a
t
t
c
a
t
dad
two to three crossovers per
chromosome per generation
Linkage Analysis
Key point: Alleles that are physically close together tend to be
inherited together because the chance of a crossover between
them is small. They exhibit strong linkage.
mom
a
c
a
t
t
c
a
t`
a
t
a
g
t
c
c
a
a
c
a
g
t
c
c
a
`
a
c
a
g
a
c
a
t
a
c
a
g
a
t
a
t
child
t
c
a
t
t
c
a
t
dad
Finding an Unknown Disease Locus
The location on the genome of many diseases is unknown. SNPs
and haplotypes are being used to search for disease loci using
linkage analysis.
mom
a
c
a
t
t
c
a
t`
a
t
a
g
t
c
c
a
a
c
a
g
t
c
c
a
`
a
c
a
g
a
c
a
child has
t disease
a
c
a
g
a
t
a
t
t
c
a
t
t
c
a
t
dad has
disease
Linkage Analysis – Dominant Model
Assume the disease is caused by a dominant allele, meaning one
copy is enough to cause the disease.
mom
a
c
a
t
t
c
a
t`
a
t
a
g
t
c
c
a
a
c
a
g
t
c
c
a
`
a
c
a
g
a
c
a
child has
t disease
a
c
a
g
a
t
a
t
t
c
a
t
t
c
a
t
dad has
disease
SNP alleles in
father that are not
in mother
Linkage Analysis – Dominant Model
Assume the disease is caused by a dominant allele, meaning one
copy is enough to cause the disease.
mom
a
c
a
t
t
c
a
t`
a
t
a
g
t
c
c
a
SNP allele in child,
inherited from
father with disease
a
c
a
g
t
c
c
a
`
a
c
a
g
a
c
a
child has
t disease
a
c
a
g
a
t
a
t
t
c
a
t
t
c
a
t
dad has
disease
Linkage Analysis – Dominant Model
Assume the disease is caused by a dominant allele, meaning one
copy is enough to cause the disease.
mom
a
c
a
t
t
c
a
t`
a
t
a
g
t
c
c
a
a
c
a
g
t
c
c
a
`
a
c
a
g
a
c
a
child has
t disease
a t
c c
a a
g t
dad has
a t
disease
t c
a a
t t
SNP allele and
disease are linked
indicating possible
disease locus.
Linkage Analysis – Recessive Model
Assume the disease is caused by a recessive allele, meaning two
copies are required to cause the disease.
mom
a
c
a
t
t
c
a
t
a
t
a
g
t
c
c
a
a
c
a
g
t
c
c
a
`
a
c
a
g
a
c
a
child has
t disease
a
c
a
g
a
t
a
t
t
c
a
t
t
c
a
t
dad has
disease
homozygous SNP
alleles in father that are
heterozygous in mother
Linkage Analysis – Recessive Model
Assume the disease is caused by a recessive allele, meaning two
copies are required to cause the disease.
mom
a
c
a
t
t
c
a
t`
a
t
a
g
t
c
c
a
homozygous SNP
allele in child,
identical to father’s
a
c
a
g
t
c
c
a
`
a
c
a
g
a
c
a
child has
t disease
a
c
a
g
a
t
a
t
t
c
a
t
t
c
a
t
dad has
disease
Linkage Analysis – Recessive Model
Assume the disease is caused by a recessive allele, meaning two
copies are required to cause the disease.
mom
a
c
a
t
t
c
a
t`
a
t
a
g
t
c
c
a
a
c
a
g
t
c
c
a
`
a
c
a
g
a
c
a
child has
t disease
a
c
a
g
a
t
a
t
t
c
a
t
t
c
a
t
dad has
disease
SNP allele and
disease are linked
indicating possible
disease locus.
BMI = weight/height2 in kg/m2, BMI > 25 overweight, BMI > 30 obese
Other Differences – Microdeletions
A microdeletion is the loss of a small piece of DNA, perhaps as
small as 1000 bases. These pieces can contain genes, parts of
genes or regulatory regions.
a
c
a
t
t
c
c
t
g
c
g
c
a
t
microdeletions
a
t
g
t
t
a
c
a
c
t
c
c
t t
Other Differences – Microdeletions
A microdeletion is the loss of a small piece of DNA, perhaps as
small as 1000 bases. These pieces can contain genes, parts of
genes or regulatory regions.
heterozygous
a
c
a
t
t
c
c
t
g
c
g
c
a
t
a
t
g
t
t
a
c
a
c
t
c
c
t t
Other Differences – Microdeletions
A microdeletion is the loss of a small piece of DNA, perhaps as
small as 1000 bases. These pieces can contain genes, parts of
genes or regulatory regions.
homozygous
a
c
a
t
t
c
c
t
g
c
g
c
a
t
a
t
g
t
t
a
c
a
c
t
c
c
t t
Other Differences – Microdeletions
A microdeletion is the loss of a small piece of DNA, perhaps as
small as 1000 bases. These pieces can contain genes, parts of
genes or regulatory regions.
a
c
a
t
t
c
c
t
g
c
g
c
a
t
miscalled
homozygous
a
t
g
t
t
a
c
a
c
t
c
c
t t
Apparent Inheritance Inconsistency
SNPs and haplotypes are used to identify regions of the genome
that cause disease. The technique is called linkage analysis and
evidence of a connection is called linkage disequilibrium (LD).
mom
a
t
a
a
g
a
a
c
c
c
c
a
c
c
c
c
a
c
`
a
c
a
t
c
c
a
c
a
c
a
t
c
c
a
c
child
c
c
c
t
c
c
a
c
dad
Apparent Inheritance Inconsistency
SNPs and haplotypes are used to identify regions of the genome
that cause disease. The technique is called linkage analysis and
evidence of a connection is called linkage disequilibrium (LD).
mom
a
t
a
a
g
a
a
c
c
c
c
a
c
c
c
c
a
c
`
a
c
a
t
c
c
a
c
a
c
a
t
c
c
a
c
c
c
c
t
c
c
a
c
dad
aa + tt → at
child
by Mendelian inheritance
Apparent Inheritance Inconsistency
SNPs and haplotypes are used to identify regions of the genome
that cause disease. The technique is called linkage analysis and
evidence of a connection is called linkage disequilibrium (LD).
mom
a
t
a
a
g
a
a
c
c
c
c
a
c
c
c
c
a
c
`
a
c
a
t
c
c
a
c
a
c
a
t
c
c
a
c
child
c
c
c
t
c
c
a
c
dad
cluster of inconsistencies
suggests a microdeletion.
Microdeletions
Hundreds of microdeletion haplotypes have been discovered
recently. They may be a major contributor to human differences
and disease.
Resources
UCSC Human Genome Browser
http://genome.ucsc.edu/cgi-bin/hgGateway
National Center for Biotechnology Information (NCBI)
http://www.ncbi.nlm.nih.gov/
PubMed
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed
Download