Genomics

advertisement
MEDG520
Block 6
Genomics
Concepts:
 Genomics vs. Genetics (vs. Proteomics)
 Sex Determination in Humans
 Assigned reading: Skaletsky et al. 2003.
 Chromosome Sequencing
 Radiation hybrid mapping.
 Sequence Mapping
o Introduction to Sequence Mapping (Griffth’s An Introduction to
Genetic Analysis)
o Chromosome-specific libraries.
o Ordering by FISH.
o Ordering by clone fingerprints.
o Ordering by sequence-tagged sites.
o An example: cloning and mapping the human Y chromosome.
 Y Haplotype Markers
 Population Genetics
o Founder Effect
o Genetic Bottleneck
o Factors affecting variation
o Genetic Drift
 Gene Dispersion
 Genetic – Geographical Association
 Gene Conversion
o Mechanisms of Gene Conversion
 Homologous Recombination
 Gene Duplication
o Mechanisms of Gene Duplication
 Comparative Genomics
 Assigned readings: Skaletsky et al. 2003.
 Evolution of Autosomal Chromosomes
 Assigned Reading: Eichler and Sankoff. 2003
 Explain Synteny: Fragile versus random breakage models.
 What is unusual about Centromeric and Telomeric regions?
 Why are duplications important?
 What effect do Transposable elements have on the Chromosomal landscape?
 Chromosomal rearrangements and repeats: Cause or consequence?
 Evolution of X & Y Chromosomes
o Mechanisms of Evolution:
 DNA Microarray Technology
o Analysis of DNA Microarray Data, Statistical Significance
o Normalization and Noise:
o Supervised versus Unsupervised:
o Distance metrics versus clustering methods
o Post analysis Challenges
o Validation and Limitations of DNA Microarrays




Statistical Significance versus Biological Significance
Gene Networks
Identification of Pathways
Comparative Genomics to Functional Phylogenomics
Genomics vs. Genetics (vs. Proteomics)
Genomics
 Whole genome approach versus a few genes
 Operationally defined as investigations into the structure and function of very
large numbers of genes undertaken in a simultaneous fashion. Genetics looks at
single genes, one at a time, as a snapshot. Genomics is trying to look at all the
genes as a dynamic system, over time, and determine how they interact and
influence biological pathways and physiology, in a much more global sense
 Examples of genomics:
o Systems biology (rather than single genes in a pathway)
o Gene modifiers
o Networks biology (how pathways interact with eachother)
Proteomics
Proteome:
 The complement of proteins expressed by an organism, tissue or cell type




The study of the full set of proteins encoded by a genome. The characterization of
patterns of gene expression at the protein level or the link between proteins and
genomes. Proteomics encompasses many different approaches to protein study,
from bioinformatics of protein content of genomes to large scale direct protein
analysis of complicated protein mixtures, and the definition of a protein's
properties, their interactions and modifications.
Look at how proteins interact
Protein-protein interactions
Look at protein expression (ie. Shiio et al 2002 paper: used microarray technology
to measure protein expression.)
Sex Determination in Humans




Y chromosome has testis determining factor (TDF) (later discovered to be SRY)
The bipotential gonadal ridge which is, that may potentially result in either a male
or female.
TDF acts on the bipotential gonadal ridge to initiate the male pathway.
Testis produce testosterone which ultimately leads to a male human
SRY
Bipotential Gonadal Ridge
Male
Female
Testis (produce testosterone)
Male

In 1990 TDF was discovered based on XX patients who had had a piece of the Y
chromosome translocated that contained the SRY gene.
 A transgenic mouse line was subsequently made with SRY that was the final
proof.
o Random Insertion
o Overexpressed SRY on female background (XXtg(sry))
o 3/11 were male, then 2 died
o last one left was called Randy the sexed reversed male
Take home message:
1. Female is default pathway
2. Need SRY, as it is the male determining factor.
Assigned reading:
Skaletsky et al. 2003. The male-specific region of the human Y chromosome is a
mosaic of discrete sequence classes

Suggest renaming of NRY (non-recombining region of Y chromosome) to MSY
(male specific region). A male-specific region flanked by pseudoautosomal
regions where X-Y recombination is frequent in male meiosis.

Why has it taken so long to construct an accurate, high-res physical map of the
MSY and how was it overcome?
o lengthy intrachromosomal repetitive sequences referred to as amplicons.
o Overcome by identifying minute variations between amplicon copies. Use
these sequence family variants as markers to be ordered for mapping.
o Required complete and accurate sequencing and comparison of amplicon
variants.
o This kind of physical mapping is called sequence mapping. See below.

What did the authors do?
o mapped and sequenced a tiling path of 220 BAC clones, each containing a
portion of the MSY from the same individual. Only one individual was
sequenced because they needed to make sure the small variations between
amplicons were not polymorphisms in the population.
o Obtained finished sequence (err=1 in 10000 bases) for all MSY
euchromatin except for two small problematic regions and much of the
heterochromatin.



What did they find?
o MSY includes 156 transcriptional units, half of which likely encode
protein.
o 60 of the 78 protein coding units fall into 9 MSY-specific gene families
(with sequence homology of 98% or greater within the families)
o The remaining 18 are single copy genes.
o Of the 27 genes or gene families, about half are expressed ubiquitously
and the other half only expressed in the testis.
o Three distinct classes of sequence make up the euchromatic MSY:
1. X-transposed – No X-Y crossing over. Lowest gene-density. Highest
LINE and SINE content. 99% identical to sequences in Xq21. Result of a
massive X-Y transposition.
2. X-degenerate – dotted with single-copy gene or pseudogene homologues
of 27 different X-linked genes (60-96% identity). Surviving relics of the
ancient autosomes from which X and Y evolved. About half of the genes
seem to be transcribed and functional, producing protein isoforms of their
counterparts on X. All the ubiquitous Y genes are located here.
3. ampliconic – very long repeats (10’s or 100’s of kbs) with very high
similarity (as much as 99.9%). Contain the most genes, including the 9
MSY specific gene families named above (with copy numbers ranging
from 2-35). Almost exclusively expressed in the testes. Have the lowest
LINE and SINE content.
Massive Palindromes and inverted repeats
o Found 8 huge palindromes in the ampliconic regions. 6 of these contain
recognizable coding genes with the gene existing in identical (or nearly
identical) form on each arm of the palindrome.
o Some of the gene families exist solely on palindromes
o Also characterize 5 sets of more widely spaced inverted repeats.
Tandem arrays
o eg. NORF (no long open reading frame) array consists of an array of ~2.48kb
repeats containing a great variety of spliced but apparently non-coding
transcriptional units.
Definitions:
Pseudogenes - Genes bearing close resemblance to known genes at different loci, but
rendered non-functional by additions or deletions in structure that prevent normal
transcription or translation. When lacking introns and containing a poly-A segment near
the downstream end (as a result of reverse copying from processed nuclear RNA into
double- stranded DNA), they are called processed genes.
Palindrome
- A region of DNA which reads the same in the forward and backward directions. An
example is 5'-CATG-3' which when both strands are considered is: 5'-CATG-3' 3'GTAC-5' So if the top strand is read, it reads 'CATG' in the forward 5' to 3' direction. If
the bottom strand is read in the 5' to 3' direction, it also reads 'CATG' and so is hence a
palindrome.
- A nucleotide sequence on a DNA molecule in which the same sequence is found on
each strand, but in the opposite direction.
- A self-complementary nucleic acid sequence, that is a sequence identical to its
complementary strand; perfect palindromes (e.g. GAATTC) frequently occur at sites of
recognition for restriction enzymes; less perfect palindromes (e.g.
TACCTCTGGCGTGATA) frequently occur in binding sites for other proteins, such as
repressors. (S)
Chromosome Sequencing

Physical maps can be divided into three general types:
1. chromosomal or cytogenetic maps
2. radiation hybrid (RH) maps,
3. sequence maps.

The different types of maps vary in their degree of resolution, that is, the
ability to measure the separation of elements that are close together. The
higher the resolution, the better the picture.

One goal of physical mapping is to identify a set of overlapping cloned
fragments that together encompass an entire chromosome or an entire genome.
The resulting physical map is useful in three ways. First, the genetic markers
carried on the clones can be ordered and hence contribute to the overall
genome mapping process. Second, when the contiguous clones have been
obtained, they represent an ordered library of DNA sequences that can be
exploited for future genetic analysis. Third, these clones form the raw material
that will be sequenced in large-scale genome projects.
Radiation hybrid mapping.
This technique was designed to produce a higher-resolution map of molecular markers
along a chromosome. The procedure is to X-ray treat human cells to fragment the
chromosomes and then fuse the irradiated cells with the rodent cells to form a panel of
different hybrids. In this case, the hybrids have an assortment of fragments of human
chromosomes, as diagrammed in Figure 14-11. Most of the fragments are seen to be
embedded in the rodent chromosomes, but truncated human chromosomes also can be
found.
A standard panel in the range of 100 to 200 radiation hybrids is quite straightforward to
obtain. Such a panel is sufficient to obtain a high-resolution cR3000 map of the human
genome, which would have 10-fold greater resolution than the current centimorgan
genetic map. One downside of the technique is that it is limited to those markers for
which human rodent differences are available.
Figure 14-15. Using sequence-tagged sites (STSs) to order overlapping clones (YACs, in
this example) into a contig. Five different YACs are tested to determine which STSs they
contain (top), and these data are used to assemble a physical map (bottom).
Sequence Mapping
Sequence tagged site, or STS mapping, is another physical mapping
technique. An STS is a short DNA sequence that has been shown to be
unique. To qualify as an STS, the exact location and order of the bases of
the sequence must be known, and this sequence may occur only once in
the chromosome being studied or in the genome as a whole if the DNA
fragment set covers the entire genome.
Common Sources of STSs

Expressed sequence tags (ESTs) are short sequences obtained
by analysis of complementary DNA (cDNA) clones.
Complementary DNA is prepared by converting mRNA into
double-stranded DNA and is thought to represent the sequences
of the genes being expressed. An EST can be used as a STS if
it comes from a unique gene and not from a member of a gene
family in which all of the genes have the same, or similar,
sequences.

Simple sequence length polymorphisms (SSLPs) are arrays
of repeat sequences that display length variations. SSLPs that
are polymorphic and have already been mapped by linkage
analysis are particularly valuable because they provide a
connection between genetic and physical maps.

Random genomic sequences are obtained by sequencing
random pieces of cloned genomic DNA or by examining
sequences already deposited in a database.
To map a set of STSs, a collection of overlapping DNA fragments from
a chromosome is digested into smaller fragments using restriction
enzymes, agents that cut up DNA molecules at defined target points.
The data from which the map will be derived are then obtained by
noting which fragments contain which STSs. To accomplish this,
scientists copy the DNA fragments using a process known as
"molecular cloning". Cloning involves the use of a special technology,
called recombinant DNA technology, to copy DNA fragments inside
a foreign host. First, the fragments are united with a carrier, also called
a vector. After introduction into a suitable host, the DNA fragments can
then be reproduced along with the host cell DNA, providing unlimited
material for experimental study. An unordered set of cloned DNA
fragments is called a library.
Next, the clones, or copies, are assembled in the order they would be
found in the original chromosome by determining which clones contain
overlapping DNA fragments. This assembly of overlapping clones is
called a clone contig. Once the order of the clones in a chromosome is
known, the clones are placed in frozen storage, and the information
about the order of the clones is stored in a computer, providing a
valuable resource that may be used for further studies. These data are
then used as the base material for generating a lengthy, continuous
DNA sequence, and the STSs serve to anchor the sequence onto a
physical map.
Introduction to Sequence Mapping (Griffth’s An Introduction to Genetic Analysis)
In the preparation of physical maps of genomes, vectors that can carry very large inserts
are naturally the most useful. Cosmids, YACs (yeast artificial chromosomes), BACs
(bacterial artificial chromosomes), and PACs (phage P1-based artificial chromosomes)
have been the main types. In a similar manner, as cloning vectors, they can also carry
inserts of fragments of foreign DNA as large as 300 kb, although the average is about 100
kb. PACs are produced by a type of engineering similar to that of phage P1; they carry
inserts comparable to those of BACs.
Although the maximum insert sizes of BACs and PACs are not as large as those of
YACs, the former types have several advantages over YACs. First, they can be amplified
in bacteria and isolated and manipulated simply with basic bacterial plasmid technology.
Second, BACs and PACs form fewer hybrid inserts than YACs do. Hybrid inserts are
composed of several different fragments; their presence can thwart attempts to order the
clones.
Cloning a whole genome begins by amassing a large number of randomly cloned inserts.
The contents of these clones must be characterized in some way, and overlaps must be
determined. A set of overlapping clones is called a contig. In the early phases of a
genome project, contigs are numerous and represent cloned "islands" of the genome. But,
as more and more clones are characterized, contigs enlarge and merge into one another,
and eventually the project should end up with a set of contigs that equals the number of
chromosomes.
Chromosome-specific libraries.
If a library of clones is prepared from total genomic DNA, then contig development is
relatively slow. However, if a specific chromosome can be used to develop the library of
clones, contigs emerge more rapidly. PFGE can be used to isolate individual
chromosomes (if they are small) or chromosome fragments cut with "long-cutter"
enzymes such as NotI. Flow sorting is another option for preparing DNA of a specific
chromosome. Chromosomes (such as human chromosomes) can be flow-sorted by
fluorescence-activated chromosome sorting (FACS; Figure 14-13). In this procedure,
metaphase chromosomes are stained with two dyes, one of which binds to AT-rich
regions and the other to GC-rich regions. Cells are disrupted to liberate whole
chromosomes into liquid suspension. This suspension is converted into a spray in which
the concentration of chromosomes is such that each spray droplet contains one
chromosome. The spray passes through laser beams tuned to excite the fluorescence.
Each chromosome produces its own characteristic fluorescence signal, which is
recognized electronically, and two deflector plates direct the droplets containing the
specific chromosome needed into a collection tube.
MESSAGE
Genomic cloning proceeds by assembling clones into overlapping groups called
contigs. As more data accumulate, the contigs become equivalent to whole
chromosomes.
Several different techniques are used to order genomic clones into contigs. We shall
consider some of the main ones.
Ordering by FISH.
If good chromosomal landmarks are known, FISH analysis can be used to locate the
approximate positions of the large inserts. Figure 14-14 shows results of a FISH analysis
that generates a rough ordering of BACs and PACs in human chromosomes.
Ordering by clone fingerprints.
The genomic insert carried by a vector has its own unique sequence, which can be used to
generate a DNA fingerprint. For example, a multiple restriction-enzyme digestion can
generate a set of bands whose number and positions are a unique "fingerprint" of that
clone. The different bands generated by separate clones can be aligned either visually or
by using a computer program to determine if there is any overlap between the inserted
DNAs. In this way, the contig can be built up.
Ordering by sequence-tagged sites.
Unique short sequences of large cloned inserts can be used as tags to align the various
clones into contigs. For example, if clone A has tags 1 and 2 and clone B has tags 2 and
3, clones A and B must overlap in the region of tag 2. The practical procedure is to amass
a large set of random clones with small genomic inserts (say, in  phage) and sequence
short regions of each. From these sequences, pairs of PCR primers are designed that will
amplify the short specific sequence of DNA flanked by the primers. These short DNA
sequences are known as sequence-tagged sites (STSs). Even though initially the location
of these STSs in the genome is not known, a panel of many STSs can be used to
characterize clones with large genomic inserts (such as YAC clones). The clones that are
shown to have specific STSs in common must have overlapping inserts and therefore can
be aligned into contigs. An example of this process is shown in Figure 14-15.
Short stretches of sequence are sometimes obtained from cDNA clones. These stretches
are known as expressed sequence tags (ESTs). ESTs are obtained by sequencing into
the cDNA insert by using a primer based on the vector sequence. They can be used to
align the cDNAs on the contig, thus anchoring the gene map to the physical map. Further,
if part of the open reading frame (ORF) of the transcript is contained within the EST, the
"virtual" translation of the ORF can provide a "sneak preview" of the function of the
protein encoded by the mRNA from which the cDNA was derived.
Furthermore, the DNA of the contigs has been arranged on nitrocellulose filters in
ordered arrays; so, to find out where a specific piece of DNA of interest lies in the
genome, that DNA is used as a probe on the contig filters, and a positive hybridization
signal announces the precise location of the DNA (Figure 14-16).
An example: cloning and mapping the human Y chromosome.
Several of the smaller human chromosomes have been fully cloned as overlapping
sets of YAC clones (contigs). We shall examine the cloning of the Y chromosome as
an example because it illustrates several of the techniques of physical mapping. The
STS map of the Y chromosome was in fact obtained by two different methods YAC
alignment and deletion analysis.
YAC alignment.
Flow sorting yielded a sample of Y chromosomes, from which 
clones were made. From clones that did not contain repetitive DNA, STS primers
were designed. In all, 160 primer pairs were made. A Y chromosome YAC library of
10,368 clones was obtained in which the average insert size was 650 kb. From these
numbers, each point on the Y chromosome was estimated to have been sampled an
average of four times. The YAC clones were divided into 18 pools of 576 YACs, and
the pools were screened with the STS primers. Subdivision of positive pools led
rapidly to the assignment of a particular STS to specific YACs. The total STS content
of each YAC was assessed, and overlaps between the YACs were determined in the
same way as that shown in the generalized example in Figure 14-15.
Deletion analysis.
Various types of Y chromosome deletions occur naturally. For
example, some XX males contain truncated fragments of the Y, whereas some XY
females have deletions of the region containing the maleness (testis-determining)
gene (see Chapters 2 and 23). These Y deletions were maintained in cell culture and
formed the basis for aligning the Y chromosome STSs. Each deletion was tested for
STS content. Because by nature the deletions were nested sets, the STS content could
be used not only to develop an STS map, but also to map the coverage of the
deletions. The principle is illustrated in Figure 14-17. The STS maps produced by
YAC alignment and by deletion analysis were identical.
MESSAGE
Clones can be arranged into contigs by matching DNA fingerprints, by matching short
sequences within cloned segments, and by analyzing deletions.
(Source: Griffith’s An introduction to Genetic Analysis)

Example: Shaletsky et al (2003) sequenced the MSY region of the Y
chromosome. Since previous efforts to sequence repetitive regions failed, they
identified minute variations between amplicon copies. They used these variations
as markers to sequence map the tiling path of BAC clones for all euchromatin of
MSY.
Y Haplotype Markers

Haplotype refers a set of closely linked genetic markers present on one
chromosome which tend to be inherited together
 Y haplotype is the easiest haplotype to get since it is a single copy, you don’t;
need parents.
 Y haplotype usually have larger haplotypes because recombination is reduced in
the Y chromosome and there is a reduced homology with the X- chromosome.
 Example: Zerjal et al (2003) The Genetic Legacy of the Mongols:
o used Y haplotypes markers to identify a Y chromosomal lineage found in 16
populations across Asia with a frequency of ~ 8%.
o Pattern of lineage suggested it originated in Mongolia ~ 1000 years ago.
o Pattern spread via selection – social selection
o The lineage is carried by male-line descendants of Genghis Kahn.
Population Genetics

Founder Effect
o One form of genetic drift occurs when a small group breaks off from a
larger population to found a new colony.
o This "acute drift," called the founder effect, results from a single
generation of sampling, followed by several generations during which the
population remains small.
o One result of random sampling is that most new mutations, even if they
are not selected against, never succeed in entering the population.
o The founder effect is probably responsible for the virtually complete lack
of blood group B in Native Americans, whose ancestors arrived in very
small numbers across the Bering Strait at the end of the last Ice Age, about
20,000 years ago. (Griffiths Modern Genetic Analysis)



Genetic Bottleneck
o A brief reduction in size of a population which usually leads to random
genetic drift.
Factors affecting variation
o Increasing Variation:
 Mutation
 Recombination
 Migration
o Decreasing Variation:
 Genetic Drift
 Natural Selection (Can also maintain allele frequency - Ex.
Heterozygote advantage: Sickle Cell Anemia – Htz protected from
Malaria)
Genetic Drift
o Random Effects on the gene pool
o Eg. 2 Alleles: A1
A2
Freq: 0.5
0.5
Next Generation:
0.49
0.51
Next Generation:
0.48
0.52
 Genetic drift has a larger effect in a smaller population
 Genetic drift would take longer in a larger population size
Gene Dispersion



Refers to the spreading of different genotypes (Aa Ba => ABAa)
Occurs through:
1. Migration
2. Recombination
Example: Genghis Kahn (illustrates an interesting cultural form of gene
dispersion)
Genetic – Geographical Association


Refers to a situation where a population moves or it gets isolated.
In this case, there is a lack of random assortment and random mating within that


gene pool (=All of the alleles available among the reproductive members of a
population from which gametes can be drawn).
Isolation may come about in 2 ways:
1. Geographically
2. Culturally
An example of cultural isolation is the Ashkenazi Jewish population which tends
to marry within and thus isolates other populations.
Gene Conversion

This is defined as the non-reciprocal transfer of genetic information between two
genes with a high degree of homology.

A meiotic process of directed change in which one allele directs the conversion of
a partner allele to its own form. In asci of Ascomycete fungi a 4:4 ratio of alleles
is expected after meiosis, yet 6:2 and 5:3 ratios are sometimes observed. A model
of recombination, produced by Holliday, suggests that gene conversion may be
explained by repair of heteroduplex DNA.

A type of nonreciprocal recombination event in which a recipient strand of DNA
receives information from another strand having an allelic difference. The
recipient strand has its original allele "converted" to the new allele as a
consequence of the event.

Gene conversion can sometimes involve transfer between repeated sequences on
the same chromosome.
Figure 9.10. Gene conversion involves a nonreciprocal sequence exchange between
allelic or nonallelic genes. (A) Interallelic gene conversion. Note the nonreciprocal
nature of the sequence exchange - the donor sequence is not altered but the acceptor
sequence is altered by incorporating sequence copied from the donor sequence. (B)
Interlocus gene conversion. This is facilitated by a high degree of sequence homology
between nonallelic sequences, as in the case of tandem repeats. (C) Mismatch repair of a
heteroduplex. This is one of several possible models to explain gene conversion. The
model envisages invasion by one strand of the donor sequence (-) to form a heteroduplex
with the complementary (+) strand of the acceptor sequence, thereby displacing the other
strand of the acceptor. Mismatch repair enzymes recognize the mispaired bases in the
heteroduplex and 'correct' the mismatches so that the (+) acceptor sequence is 'converted'
to be perfectly complementary in sequence to the (-) donor strand. Subsequent replication
of the (-) acceptor strand and sealing of nicks results in completion of the conversion.
(Source: Strachan and Reed Human Molecular Genetics 2)
Homologous sequences. Orthologs and Paralogs are two types of homologous
sequences. Orthology describes genes in different species that derive from a common
ancestor. Orthologous genes may or may not have the same function. Paralogy describes
homologous genes within a single species that diverged by gene duplication.
Mechanisms of Gene Conversion
(See Fig. 9.10 above for illustration)
There are several different theories or hypotheses about the mechanisms of gene
conversion.
Homology directed double strand break repair –

occurs in homologous chromosomes or sister chromatids

When the damaged DNA and template DNA are slightly different
Mismatches are then corrected by base excision repair or nucleotide excision
repair. Thus, double stranded breaks are repaired by homologous recombination
where the broken chromosome is patched up by copying information from a
homologous chromosome or sister chromatid.
The following figure illustrates the difference between gene conversion and DNA
crossover.
Comparison between gene conversion and DNA crossover. (a) Two DNA molecules.
(b) Gene conversion - the red DNA donates part of its genetic information (e-e' region)
to the blue DNA. (c) DNA crossover - the two DNAs exchange part of their genetic
information (f-f' and F-F').
An origin of gene conversion. (a) Heteroduplexes formed by the resolution of Holliday
structure or by other mechanisms. (b) The blue DNA uses the invaded segment (e') as
template to "correct" the mismatch, resulting in gene conversion. (c) Both DNA
molecules use their original sequences as template to correct the mismatch. Gene
conversion does not occur.
The widely accepted model for genetic recombination was first proposed by Robin
Holliday in 1964. It involves several steps as illustrated in the following figure.
Homologous Recombination
Homologous recombination occurs between two homologous DNA molecules. It is also
called DNA crossover. During meiosis, two homologous pairs of sister chromatids align
side by side. The DNA crossover is very likely to occur. It could be as often as several
times per meiosis.
DNA crossover. (a) Two homologous pairs of sister chromatids align side by side. (b)
The two homologs are connected at a certain point called chiasma. (c) The two
homologs exchange the DNA segment from the chiasma to the end of chromosomes.
The Holliday model of DNA crossover (genetic recombination).
(a) Two homologous DNA molecules line up (e.g., two nonsister chromatids line up
during meiosis).
(b) Cuts in one strand of both DNAs.
(c) The cut strands cross and join homologous strands, forming the Holliday structure (or
Holliday junction).
(d) Heteroduplex region is formed by branch migration.
(e) Resolution of the Holliday structure. Figure 8-D-2e is a different view of the Holliday
junction than Figure 8-D-2d. DNA strands may be cut along either the vertical line or
horizontal line.
(f) The vertical cut will result in crossover between f-f' and F-F' regions. The
heteroduplex region will eventually be corrected by mismatch repair.
(g) The horizontal cut does not lead to crossover after mismatch repair. However, it
could cause gene conversion.
Gene Duplication

The presence of an extra segment of DNA, resulting in redundant copies of a
portion of a gene, an entire gene, or a series of genes, usually caused by unequal
crossing-over during gene replication when gametes are formed in meiosis.
Mechanisms of Gene Duplication

Due to significant sequence homology, the repetitive sequence region of a
chromatid may not line up exactly with its corresponding region in a homologous
chromatid or identical sister chromatid. As a result, different number of repeat
units may be generated during meiosis. This is thought to be the major
mechanism underlying VNTR (variable number of tandem repeats ).
Unequal crossover and sister chromatid exchange.
(a) Two pairs of sister chromatids line up during meiosis. A repetitive region of one
chromatid (the third one) does not line up exactly with its corresponding region in other
chromatids.
(b) Strand breaks on nonsister chromatids (along line A) will result in unequal crossover,
producing different number of repeat units in these chromatids.
(c) Strand breaks on sister chromatids (along line B) also produce different repeats. In
this case, it is called sister chromatid exchange. The detailed mechanism of DNA
crossover in (b) and (c) may be explained by the Holliday model.
Comparative Genomics




The study of comparing complete genome sequences, to understand general
principles of genome structure and function
The study of human genetics by comparisons with model organisms such as mice,
the fruit fly, and the bacterium E. coli.
A comprehensive view of large-scale changes in synteny, gene order, and regions
of nonconservation while simultaneously affording exquisite molecular resolution
at the level of the nucleotide.
Example: Comparative analysis of the Y chromosome using human and ape
genomes (See Shaletshy et al (below) for details).
Assigned readings:
Skaletsky et al. 2003. Abundant gene conversion between arms of palindromes in
human and ape Y chromosomes.
What did they do?
 Looked at human MSY palindromes. These are very large palindromes with very
high intra-palindromic sequence identity (99.87%).
 This high degree of identity implicates a recent gene duplication (gene
amplification?)
 However, comparative analysis with monkeys finds very similar genes and
indicates the palindromes might predate the human-primate divergence.
 They suggest gene conversion as an alternative explanation for the high degree of
similarity between the palindrome arms.
How did they test this hypothesis?
 looked for MSY palindromes in common ancestors of humans and chimpanzees –
ie. Chimpanzees, bonobos, and gorillas.
 Used PCR to amplify inner and outer boundaries of all 8 palindromes for all
species.
 Identified (by sequence homology) and sequenced chimp BACs corresponding to
4 of 8 palindromes.
 Studied a C/T SNP using 171 unrelated men that are known to represent 42
distinct branches of a robust tree of human Y chromosome genealogy. See figure
in paper.
What did they find?
 Most palindromes found in the human MSY were present in the human-primate
common ancestor.
 Inner boundaries are more conserved than outer boundaries.
 BAC chimp palindrome sequencing revealed only 1.44% sequence divergence.
Probably just represents accumulation of neutral mutations in human and chimp
lineages after separation.


Found same extremely high sequence identity between two arms of palindrome in
chimps as in humans.
Y chromosome genealogy data show predominantly C on both arms (C/C). A
C/T genotype recently arose but was subsequently and rapidly converted to T/T or
back to C/C by gene conversion.
Conclusions
 MSY palindromes predate separation of human and chimp lineages
 Paired arms of palindromes (for humans and chimps) evolved in concert.
 Gene conversion between the two arms removes polymorphisms as they arise.
High degree of similarity between arms indicates a steady state balance between
new mutations and gene-conversion events that erase them.
 Calculate that on average, 600 duplicated nucleotides undergo arm-to-arm gene
conversion for every son that is born in recent evolutionary history.
 MSY (male specific region) used to be known as NRY (non-recombining region)
but in fact, it recombines within itself at a very high rate.
 Conservation between humans and chimps in the palindromic regions is
extremely high even for non-functional sequences (eg. Alu repeats). Suggest
there might be a slight bias of gene conversion towards original sequence.
 Cast doubt on molecular clock methods that give recent dates to gene duplications
because of high similarity. If these duplications are also present in ancient
ancestors, they could be much older and similarities between duplications may be
the result of a gene conversion mechanism as proposed here.
 Suggest that this mechanism may have evolved to combat the evolutionary decay
observed elsewhere in the Y chromosome and explains why most intact testisspecific genes are located in the palindromic arms. A safe place to keep
important genes.
Evolution of Autosomal Chromosomes

Whole Genome Duplication:
o Ex/Proof: Drosophila gene: had 4 human paralogues
4 N -- (Mutate) 2N
-- (Mutate) 2N (both mutated so much that become different from original)

Subgenomic Duplication
o Methods:
1. Non-reciprocal homologous recombination
 ↑ variation
2. Non-reciprocal sister chromatid exchange
 0 variation, because it is an exact switch
3. Transposition:
 One gene jumps out of spot and moves somewhere else.
 Retrotransposons – RNA copy from DNA and synthesize DNA
 Duplication is an essential evolutionary mechanism, as one copy is
allowed to change because the other copy provides a back-up for
functional purposes.
1
2
Figure of two homologous chromosomes.
1: Non-reciprocal homologous recombination
2: Non-reciprocal sister chromatid exchange
Assigned Reading: Eichler and Sankoff. 2003. Structural dynamics of eukaryotic
chromosome evolution.
What points are they trying to make?
 Chromosomes evolve by modification, acquisition, deletion, and/or rearrangement
of genetic material.
Explain Synteny: Fragile versus random breakage models.
 Chromosome structure for 2 eukaryotic genomes with a common ancestor is
altered by intrachromosomal inversions or reciprocal interchromosomal
translocations.
 Conserved synteny is observed when a number of sequence markers map to a
single chromosome in each genome, irrespective of order. If the markers have the
same order we call it a homologous segment
 The random breakage model hypothesizes that breakpoints between homologous
segments will be uniformly random. When chromosomes are considered on a
large, low-res scale, this is generally true.
 The fragile breakage model arose as genomes were completely sequenced and a
much higher resolution analysis conducted. When micro-rearrangements are
considered (small inversions, deletions, or transpositions within otherwise
conserved segments) a lot more breakpoints are found at highly variable
concentrations between chromosomal regions.
What is unusual about Centromeric and Telomeric regions?
 Unusually dynamic
 Sequencing and annotation remains spotty because of technical problems
 Often composed of tandem arrays of repetitive sequences
 Transition regions between centromere and telomeres and rest of sequence
(subtelomeric and pericentromeric) are HOTSPOTS for insertions or retention of
repeat sequences and contain blocks of recently duplicated DNA.
 Frequent non-homologous exchanges
 Show radical changes in gene order, harbor novel sequences, extensive genomic
rearrangement, preferential sites of reciprocal translocations.
Why are duplications important?
 Engines of gene and genome evolution.
 Whole-genome duplications – cataclysmic genomic events. Require formation of
a tetraploid (4n). Disomic state is gradually reestablished after extensive
rearrangements and deletions.
 Segmental duplications – Duplication of small portions of chromosome in tandem
or transposed to new locations in genome. Often observed as tandem arrays of
gene families.
 Can obscure orthologous relationships and promote nonallelic homologous
recombination.
 As much as 15-50% of genes owe their existence to genomic duplications.
 Segmental duplications tend to create gaps in conserved synteny between species
because of extensive rearrangements, functional diversification, or concerted
evolution of gene family members.
 Human genome is unique in having a large proportion of recent segmental
duplications that are interspersed. These promote further rearrangement through
their own misalignment and subsequent nonallelic homologous recombination.
What effect do Transposable elements have on the Chromosomal landscape?
 Eukaryotic genomes contain substantially differing amounts of repetitive DNA
because of the different propagation and deletion of selfish genetic elements.
 LINEs, SINEs, and long terminal-repeats (LTRs) propagate by reversetranscription of an RNA intermediate.
 DNA transposons move by a direct “cut and paste” mechanism.
 Rates of retrotransposition and deletion vary a great deal even between closely
related species.
 In some species (cereal grains) it is so extensive as to lead to a doubling in
genome size. This species require counter-balancing mechanisms like illegitimate
recombination, and unequal homologous recombination to prevent “genome
obesity”.
 Repeats are not randomly distributed. Eg. L1 repeats prefer gene-poor AT-rich
regions. These biases probably reflect differences in selective constraint and
recombination.
Chromosomal rearrangements and repeats: Cause or consequence?
 Clearly repetitive elements are related to chromosomal rearrangements. But, in
some cases it may be the cause and in others the consequence.
 It has not been clearly determined yet.
Evolution of X & Y Chromosomes






Arouse from ordinary pair of autosomes
~ 300 million years ago
X-Y divergence begun when X-Y cross-over ceased (↑ evolution - ↓ cross-over)
Inversions on the Y chromosome may have suppressed crossing over with X
chromosome. (Eg. X-degenerate Region of MSY is the product of evolutionary
suppression of (region by region ) of crossing over in ancestral autosomes)
X-Chromosome:
o High degree of conservation,
o therefore, low evolution, and
o high degree of synteny between distant relatives
Y-Chromosome:
o High degree of rapid evolution, and
o therefore unconstrained chromosomal evolution
o Low recombination within chromosome
o High degree of homology mediated rearrangement
o High degree of gene conversion within duplicated sequences
o Low degree of synteny.
Mechanism of Evolution:
1. Mutation
 During time when sex was determined by temperature fluctuations
 A mutation event in the promoter which was constitutively switched on could lead
sex determination despite temperature fluctuations.
 Therefore this mechanism which may be passed on for generations could lead to a
genetic sex.
2. Inversions
 inversion inhibit recombination
 Allow mutations to accumulate
 Accumulations of non-functional genes
 Series of deletions of these genes (no consequence because non-functional) lead
to an overall reduction in size of the Y- chromosome.
 Referred to as Muller’s Ratchet Theory (accumulations of deletions with no
consequences).
 Thus, many believe that the Y-chromosome will be wiped out and male sex may
be determined by a single X, like in Drosophila.
PAR: pseudoautosomal region
Euchromatic
Region*
Non-recombining region = MSY
Heterochromatic
Region
PAR: pseudoautosomal region
Y CHROMOSOME
*3 Classes in Euchromatic MSY:
1. X-transposed
2. X-degenerate
3. Ampliconic
 MSY is well conserved from mice humans
 All other regions are not conserved ( areas of recombination)
DNA Microarray Technology


Microarray analysis is an umbrella term that is often used to refer to cDNA
microarrays, oligonucleotide arrays, and SAGE.
Please refer to Block #1 for details
Analysis of DNA Microarray Data, Statistical Significance
Normalization and Noise:
Normalization
 Some kind of normalization is usually required when comparing more than one
microarray experiment.
 Adjust to account for differences in overall brightness of slides
 Normalize relative to housekeeping genes
Noise





Refers to variability and reproducibility of microarray experiments
Intra and inter-microarray variations can significantly skew interpretation of data
Sample collection is very important. If comparing two conditions you must
control for all variables other than the one you are trying to measure
Technical noise can result from imperfections in the chip.
Both biological and technical replicates are required to measure and control these
sources of noise
Supervised versus Unsupervised:
Supervised
 analysis to determine genes that fit a predetermined pattern
 Usually used to find genes with expression levels that are significantly different
between groups of samples or finding genes that accurately predict a
characteristic of the sample
 Two popular supervised techniques would be nearest-neighbour analysis and
support vector machines.
Unsupervised
 analysis to characterize the components of a data set without a priori input or
knowledge of a training signal
 Try to find internal structure or relationships in data without trying to predict
some ‘correct answer’.
 Three classes:
1. Feature determination
Look for genes with interesting patterns
Eg. Principal-components analysis
2. Cluster determination
Determine groups of genes with similar expression patterns
eg. Nearest-neighbour clustering, self-organizing maps, k-means clustering,
2d hierarchical clustering
3. Network determination
Determine graphs representing gene-gene or gene-phenotype interactions.
Eg. Boolean networks, Bayesian networks, relevance networks
Distance metrics versus clustering methods
Distance Metrics
 Measure of dissimilarity – indicates degree of similarity
Pearson Correlation
Absolute Pearson correlation
Uncentred Pearson
Absolute uncentred Pearson
Euclidian distance
Harmonically summed Euclidian distance.
Spearman's Rank
Kendall's T
City-block distance
Clustering methods
 Builds on dissimilarity measures to create groups with similar features
1.
2.
3.
4.
5.
6.
Hierarchical Clustering
Self-Organizing Maps
Relevance networks
Principle Components analysis
Nearest neighbours
Support Vector machines
Figure 3 | Clustering and network-determination methods used in microarray analysis. The choice of
the proper method and the results obtained clearly depend on the starting hypothesis. This figure shows the
results of six analytical methods applied to the same hypothetical data set a | Hierarchical clustering sorts
all genes (or samples), such that similar genes appear near each other. The length of the branch is inversely
proportional to the degree of similarity. Shades of red indicate increased relative expression; shades of
green indicate decreased relative expression. b | Self-organizing maps find variable-sized clusters of genes
that are similar to each other, given the input number of clusters to find. c | Relevance networks find and
display pairs of genes with strong positive and negative correlations, then construct networks from these
gene pairs; typically, the strength of correlation is proportional to the thickness of the lines between genes,
and red indicates a negative correlation. d | Principal-components analysis is typically used as a
visualization technique, showing the clustering or scatter of genes (or samples) when viewed along two or
three principal components. In the figure, a principal component can be thought of as a ‘meta-biological
sample’, which combines all the biological samples so as to capture the most variation in gene expression. e
| The nearest-neighbour supervised method first involves the construction of hypothetical genes that best fit
the desired patterns (for example, a gene with high expression in disease 1 and low expression in disease 2,
or vice versa). The technique then finds individual genes that are most similar to the hypothetical genes. f |
Instead of restricting to individual genes, support vector machines efficiently try several mathematical
combinations of genes to find the line (or plane) that best separates groups of biological samples. CD3G,
CD3G antigen, γ-polypeptide; CD28, CD28 antigen; IL-24, interleukin-24; PHKB, phosphorylase kinase-β;
PTGS2, prostaglandin-endoperoxidase synthase 2; PXF, peroxisomal farnesylated protein; TCF12,
transcription factor 12; STAT1, signal transducer and activator of transcription 1.
Post analysis Challenges
 The rate limiting step is the post-analytical work where you try to determine what
the actual results mean.
 Names and information for ‘genes’ on slide may be unknown or ambiguous
 Probes that were initially thought to be unique might later be found to hit two or
more genes.
 Probes are sometimes designed against chromosomal regions and it is unclear
which gene they correspond to.
 The analysis of a microarray experiment is never complete because we are still
learning about the genes on the array.
Assigned readings:
Ramaswarmy et al. 2003. A molecular signature of metastasis in primary solid
tumours.
Stuart et al. 2003. A gene-coexpression network for global discovery of Conserved
genetic modules
Useful reference:
Butte A. 2002. The use and analysis of microarray data.
Nat Rev Drug Discov. 1(12):951-60.
Validation and Limitations of DNA Microarrays
Limitations



DNA microarray studies are often not “hypothesis driven”, and have been
associated with “fishing”. This being said, microarray studies can often generate
hypothesis driven experiments
DNA microarray experiments requires large amount of high quality sample,
generally 50-200g
Will only pick up genes that are highly expressed, may miss genes that are
upregulated, but still not above multiple fold increase desired and determined by
researcher, or in such low copy number not picked up at all by array.



Reproducibility - Need to standardize technology to enable comparison of data
between labs
o Introduction of artifacts is possible at any time during array experiment
No consensus about how to interpret gene expression patterns of hypothetical
genes, genes of unknown function or transcripts identified only by ESTs
Hard to handle large amounts of data that is produced.
Validation


Experimental Quality Control:
o An up front validation is achieved by including separate regions of each
gene as an individual target on the microarray
o Optimization of each experiment is required
o To eliminate background noise and false positives, perform repeats of
experiments.
 Multiple arrays
 Replicates of each RNA sample
 Replicates of RNA preparation
o During Image Acquisition
 Normalization of genes of interests’ expression to a house keeping
gene.
 Background subtraction
 Data processing
 Standardization of data (ie. Only keep genes that were expressed at
least 2 fold higher than in control)
 Use of visualization tools
o There is no standard for numerical analysis
 Mostly achieved using high end computational methods
 Clustering
 Pattern identification
2 main validation techniques:
o In Silico
 Compare your array results with the information available in the
literature, or in public and private databases.
 Cross reference microarray results for agreement with known
expression information in literature. This validates general
performance of your microarray system, and provides confidence
in overall data.
 This will become more useful when standardized methods of
reporting results are uniformly applied.
o Laboratory Based
 Use of independent experimental verification of gene expression
levels.
 Typically use the same samples as were used on the array
 Methodology used depends on the experimental question, but often
includes semi-quantitative RT-PCR (reverse transcription PCR),
Real time RT-PCR, Northern Blot, ribonuclease protection assay,

in situ hybridization or immunohistochemistry with tissue
microarrays.
Also need to validate Universality of results
o Determine if the experimental profiles are a universal feature of the
biological phenomenon under study.
o Are the data an essential descriptor of biological state?
 Address this via evaluation or a critical gene set in larger and more
extensive study group (i.e. In Silico or In Lab)
Recommended Read:
Chuaqui RF et al. Post-analysis follow-up and validation of microarray experiments.
(2002) Nature Genetics Supplement (32) 509-514
Statistical Significance versus Biological Significance
Biological significance
"Biological significance" refers to a statistically significant effect that has a noteworthy
impact on health or survival. If an observed effect is small but quite precise (i.e., there is
little uncertainty in the observed value), the effect can be statistically significant even if it
isn't biologically significant. For example, a factor that causes a decrease in blood
pressure of 1 mmHg on the average can be statistically significant if tested in a large
group of people, but an average reduction of 1 mmHg in blood pressure has no practical
clinical implication per se.
Statistical significance
"Statistical significance" means statistical analysis has revealed an effect unlikely to have
occurred by chance alone. The level of significance refers to the degree to which the
result could be explained by chance. At the .05 (5%) level the result could have occurred
by chance 1 time in 20; at the .01 (1%) level the result could have occurred by chance
only 1 time in 100. Any effect observed in a study or experiment carries with it some
degree of uncertainty, or imprecision, because of randomness and variability in most
biological phenomena. Statistical techniques evaluate an observed effect in view of its
precision to determine with what probability it might have arisen by chance (the level of
significance). Values with a low probability of occurring by chance are called
"statistically significant" and are thought to represent a real effect.
So statistical significance tells you if there is a difference, while biological significance
tells you whether that difference is important.
Gene Networks

A method to examine cellular processes and functionally characterize genes by
identifying groups of genes that appear to be co expressed



Based on the concept that genes that are co expressed are associated and are
functionally contributing together in pathways.
Rationale: if a gene is linked in a network to many genes that participate in the
same biological process, it is reasonable to hypothesize that it also participates in
the process.
Problems:
o May miss genes known to be involved in pathways
o No consensus exists as to how to interpret the gene expression patterns of :
 Hypothetical genes
 Genes of unknown function
 Transcripts identified only by ESTs
o Inefficiently samples biological variability of a system
o Statistical significance ≠ biological significance
Identification of Pathways


Rationale: By using gene networks approach, proteins located in the same
pathway should be expressed at the same time. However, microarrays pick up so
much expression data that although may be statistically significant, it may not
necessarily be biologically significant.
Therefore, a better approach would be to look at expression across species.
Comparative Genomics to Functional Phylogenomics





Due to the above limitations, a better method would be to look at gene expression
across species to see gene that are coexpressed.
Genes that are co-expressed across species are more likely to be functionally
significant.
Therefore phylogenomics is a method of increasing biological significance by
studying co-expression across species (instead of multiple number of times within
one species, aka statistical significance).
Phylogenomics: The study of the evolution of genes and gene families using
DNA sequence information from organisms selected at major branch points along
the phylogenetic continuum.
Phylogenetic: Of or pertaining to the history of ancestry and descent.
Download