Chase

advertisement
Genome organization
&
its genetic implications
Lander , ES (2011) Initial impact of the sequencing of the
human genome. Nature 470:187
Feuillet, C, JE Leach, J Rogers, PS Schnable, K Eersole (2011)
Crop genome sequencing: lessons and rationales. Trendt Plant
Sci 16:77
DNA sequencing technologies
Read length
Speed
Cost / human
genome
First gen
(Sanger)
800 bases
Next gen
(454/Illumina/APG)
30-300 bases
0.1Gb/day
$70, 000,000
1-5 Gb/day
$75,000-$250,000
Metzker, M (2010) Sequencing technologies – the next
generation. Nature Rev Genet 11:31
What are the challenges for the correct
assembly of genome sequence
information?
• Genome size
Eukaryotic genomes ~ 109 – 1010 bp
• Genome composition
Eukaryotic genomes ~ 50 % repetitive DNA
Genome size – the C-value paradox
genome size in basepairs
Genome Size – the C value paradox:
The amount of DNA in the haploid cell of an
organism is not related to its evolutionary
complexity or number of genes
Genome composition
• Complexity = length in nucleotides of longest nonrepeating sequence that can be formed by
splicing together all unique sequence in a sample
• Eukaryotic genomes contain different classes of
DNA based on sequence complexity:
highly repetitive
middle repetitive
unique
Genome composition – DNA reassociation kinetics
complexity in
[moles of nucleotide / liter] x sec
Genome composition - DNA re-association
kinetics for a complex eukaryotic genome
highly repetitive sequences
middle repetitive sequences
single copy
sequences
[moles of nucleotide / liter] x sec
From genome composition to genome
organization
How are unique, middle repetitive and highly
repetitive sequences organized in the genome?
Genome organization
E. coli
S. cerevisiae
H. sapiens
gene desert
= Gene
gene island
Z. mays
= Repeat
Genetic complexity
• Eukaryotic genomes contain ~ 20,000 – 30,000
genes
• 30% of protein coding genes are members of gene
families
duplication & divergence of sequence & gene
function
Gene complexity
• What does a gene look like from a sequence or
transcript perspective?
no “typical gene”
• Introns and exons
introns can be numerous and long, i.e. some genes are
more intron than exon!
alternative splicing variants are common
• Not all genes encode proteins
non-coding structural RNAs (e.g. rRNA, tRNA, snRNA,
snoRNA)
non-coding regulatory RNAs (e.g. miRNA, lncRNA)
Implications of gene and genetic
complexity
• Forward genetics: Have mutant – want gene
• Via map-based cloning:
Map your mutation
Look at the genome sequence in the map interval to
identify candidate genes
• Candidate gene identification may not be trivial,
even with good genome annotation!
Especially an issue for plant genome sequences – only
arabidopsis and rice are considered “finished” quality
• Note further genetic tests required, even if the
perfect candidate is identified.
Gene identification - open reading frames
5'atgcccaagctgaatagcgtagaggggttttcatcatga
frame 1
atg ccc aag ctg aat agc gta gag ggg ttt tca tca taa
M
P
K
L
N
S
V
E
G
F
S
S
*
frame 2
tgc cca agc tga ata gcg tag agg ggt ttt cat cat tgg
C
P
S
*
I
A
*
R
G
F
H
H
How to tell real orfs from random chance orfs?
•
•
•
•
Gene identification - short orfs can be
translated! • e.g. the drosophila tarsal-less gene
Galindo et al.
PLoS Biol 5(5): e106
doi:10.1371/journal.pbio.0050106
Gene identification – database searching
e.g. http://blast.ncbi.nlm.nih.gov/Blast.cgi
Gene identification – shared
synteny
Preserved localization of genes on
chromosomes of different species
e.g. mouse chromosome 11 and parts of
5 different human chromosomes
Perfect correspondence in order,
orientation and spacing of 23 putative
genes, and 245 conserved sequence
blocks in noncoding regions
Caution! Even regions of high synteny
may not show perfect gene-for-gene
correspondence
from Gibson & Muse (2002) A Primer of Genome Science,Sinauer Inc.
Gene identification – shared synteny
Preserved
localization of
genes on
chromosomes of
different species
e.g. maize –
sorghum (G) rice (H)
Schnable et al.
Science 326:1112
Gene identification – promoter elements
• TATA – box elements
5'-TATAAA-3' or variant
plant and animal promoters
• CpG islands
Regions of higher than expected CpG dinucleotide
content, un-methlylated in active promoters
~ 40% of mammalian promoters
~ 70% of human promoters
but NOT in plant promoter regions
• Y patch (pyrimidine-rich patch)
plant not mammalian promoters
Gene identification – introns & exons
• Long gene space more intron than exon
• Extreme example - human clotting factor VIII gene
Gene identification – alternative splicing
variants
Pistoni et al. RNA Biol 7:441
Gene identification – trans-splicing
Gingeras, Nature 461: 206
Gene identification – non-coding RNAs
• non-coding structural RNAs
rRNA & tRNA – transcription & translation
snoRNA – small nucleolar RNAs
guide chemical modification of rRNAs & tRNAs
snRNA – small nuclear RNAs
guide splicing reactions
• non-coding regulatory RNAs
miRNA & siRNA - small interfering RNAs
RNAi pathway
lncRNA - long noncoding RNAs
Origins of long non-coding RNAs
Overlapping transcriptional architecture
• e.g. the human phosphatidylserine decarboxylase (PISD)
gene
Kapranov, Nature Rev Genet 8:413
Functions of lncRNAs
Wilusz et al. Genes Dev. 23: 1494–1504
Genome - Transcriptome - Proteome
• Genome
Full complement of an organism’s hereditary information
• Transcriptome
Full set of RNA molecules, coding and non-coding,
transcribed from the genome
• Proteome
Full set of proteins expressed from a genome
• Not a 1:1:1 correspondence
Implications of gene and genetic
complexity
• What is the take-home message for forward
genetics?
Implications of gene and genetic
complexity
• Reverse genetics: Have gene – want phenotype
Predict phenotypes based on gene function in other
organisms
Knock out or knock down your gene of interest & look for
corresponding changes in phenotype
Gene families
• Gene duplication followed by:
Duplication of gene function
Divergence of gene function
Loss of gene function leading to a pseudogene
• e.g. human
globin gene
family
Gene families
• Gene duplication followed by:
Duplication of gene function
Divergence of gene function
Loss of gene function leading to a pseudogene
• e.g. human beta-globin gene cluster
chromosome 11
Five functional genes and two pseudogenes
Gene families – paralogs & orthologs
• Homologs
Protein or DNA sequences having shared ancestry
• Orthologs
Homologs created by a speciation event
May or may not retain the same function!
• Paralogs
Homologs created by a gene duplication event
May or may not retain the same function!
• It is not always easy or possible to distinguish orthologs
from paralogs when comparing genes or proteins
between species
Gene families – paralogs & orthologs
globin gene
paralogs
Gene families – paralogs & orthologs
orthologs
paralogs
orthologs
orthologs Storz et al. IUBMB
Life 63:313
Implications of gene and genetic
complexity
• What are the implications of gene families for
forward genetics (i.e. looking for candidate genes
that condition a mutant phenotype?)
•What are the implications of gene families for
reverse genetics (i.e. altering gene function and
looking for a phenotype)?
Genome organization – repeated
sequences ~ 50% of the genome
• Segmental duplications and copy number
variation
• Tandemly repeated genes
rRNA, tRNA and histone gene products needed in large
amounts
• Duplicated gene families
• Transposons
• Tandem simple sequence repeats
centromeric & telomeric repeats
minisatellites
microsatellites
Repeated sequences – segmental
duplications & copy number variants
• Segmental duplications
> 1 kb block of duplicated sequence with > 90%
sequence identity
recombine to mediate further copy number variants
Koszul & Fischer, C.R. Biologies 332:254
Repeated sequences – segmental
duplications & copy number variants
Repeated sequences – segmental
duplications & copy number variants
• Copy number variant
(CNV)
Deviation from diploid
copy number at a locus
• Copy number
polymorphism (CNP)
CNV present in >1% of a
population
• Recent association with
human developmental
syndromes
Girirajan et al. Annu Rev Genet 45:203
Transposon-derived repeated sequences
• ~ 45% of human & 85% of maize genome
Transposon-derived repeated sequences
• Many are truncated & inactive
• Considered to be important in the
evolution of genome organization
& function
Gogvadze & Buzdin
Cell Mol Life Sci 66:3727
Repeated sequences – short tandem repeats
• Centromeric
Long array (~100,000 bp) of short tandem repeats
~ 5bp drosophila, ~150 bp maize, ~170 bp human
not conserved across species
in some cases not even conserved in all chromosomes
of the same species
Association with a centromere-specific histone H3
• Telomeric
Length varies between species
~ 300 base pairs - 150 kilobasepairs
Conserved, G-rich repeat sequence
vertebrates TTAGGG ; most plants TTTAGGG
Repeated sequences – short tandem
repeats
• Minisatellites (Variable number tandem repeats,
VNTRs)
10-100 bp repeat units
500-30,000 bp arrays
The original DNA fingerprinting marker via Southern
blotting
Now supplanted by microsatellites
Repeated sequences – short tandem repeats
• Microsatellites (Simple sequence repeats, SSRs)
Di, tri or tetra-nucleotide repeats; 1-10 repeat units per
locus
Repeat numbers expand or contract over a short
evolutionary, or even generational time-frame
Amplified by PCR
Primers based on unique flanking sequence
Products fractionated by capillary or acrylamide gel electrophoresis
Co-dominant mapping & fingerprinting markers
Both alleles can be detected in a heterozygous individual
variety A
[CACACACA]
variety B
[CACA]
[GTGTGTGT]
[GTGT]
Download