MEDG520 Block 6 Genomics Concepts: Genomics vs. Genetics (vs. Proteomics) Sex Determination in Humans Assigned reading: Skaletsky et al. 2003. Chromosome Sequencing Radiation hybrid mapping. Sequence Mapping o Introduction to Sequence Mapping (Griffth’s An Introduction to Genetic Analysis) o Chromosome-specific libraries. o Ordering by FISH. o Ordering by clone fingerprints. o Ordering by sequence-tagged sites. o An example: cloning and mapping the human Y chromosome. Y Haplotype Markers Population Genetics o Founder Effect o Genetic Bottleneck o Factors affecting variation o Genetic Drift Gene Dispersion Genetic – Geographical Association Gene Conversion o Mechanisms of Gene Conversion Homologous Recombination Gene Duplication o Mechanisms of Gene Duplication Comparative Genomics Assigned readings: Skaletsky et al. 2003. Evolution of Autosomal Chromosomes Assigned Reading: Eichler and Sankoff. 2003 Explain Synteny: Fragile versus random breakage models. What is unusual about Centromeric and Telomeric regions? Why are duplications important? What effect do Transposable elements have on the Chromosomal landscape? Chromosomal rearrangements and repeats: Cause or consequence? Evolution of X & Y Chromosomes o Mechanisms of Evolution: DNA Microarray Technology o Analysis of DNA Microarray Data, Statistical Significance o Normalization and Noise: o Supervised versus Unsupervised: o Distance metrics versus clustering methods o Post analysis Challenges o Validation and Limitations of DNA Microarrays Statistical Significance versus Biological Significance Gene Networks Identification of Pathways Comparative Genomics to Functional Phylogenomics Genomics vs. Genetics (vs. Proteomics) Genomics Whole genome approach versus a few genes Operationally defined as investigations into the structure and function of very large numbers of genes undertaken in a simultaneous fashion. Genetics looks at single genes, one at a time, as a snapshot. Genomics is trying to look at all the genes as a dynamic system, over time, and determine how they interact and influence biological pathways and physiology, in a much more global sense Examples of genomics: o Systems biology (rather than single genes in a pathway) o Gene modifiers o Networks biology (how pathways interact with eachother) Proteomics Proteome: The complement of proteins expressed by an organism, tissue or cell type The study of the full set of proteins encoded by a genome. The characterization of patterns of gene expression at the protein level or the link between proteins and genomes. Proteomics encompasses many different approaches to protein study, from bioinformatics of protein content of genomes to large scale direct protein analysis of complicated protein mixtures, and the definition of a protein's properties, their interactions and modifications. Look at how proteins interact Protein-protein interactions Look at protein expression (ie. Shiio et al 2002 paper: used microarray technology to measure protein expression.) Sex Determination in Humans Y chromosome has testis determining factor (TDF) (later discovered to be SRY) The bipotential gonadal ridge which is, that may potentially result in either a male or female. TDF acts on the bipotential gonadal ridge to initiate the male pathway. Testis produce testosterone which ultimately leads to a male human SRY Bipotential Gonadal Ridge Male Female Testis (produce testosterone) Male In 1990 TDF was discovered based on XX patients who had had a piece of the Y chromosome translocated that contained the SRY gene. A transgenic mouse line was subsequently made with SRY that was the final proof. o Random Insertion o Overexpressed SRY on female background (XXtg(sry)) o 3/11 were male, then 2 died o last one left was called Randy the sexed reversed male Take home message: 1. Female is default pathway 2. Need SRY, as it is the male determining factor. Assigned reading: Skaletsky et al. 2003. The male-specific region of the human Y chromosome is a mosaic of discrete sequence classes Suggest renaming of NRY (non-recombining region of Y chromosome) to MSY (male specific region). A male-specific region flanked by pseudoautosomal regions where X-Y recombination is frequent in male meiosis. Why has it taken so long to construct an accurate, high-res physical map of the MSY and how was it overcome? o lengthy intrachromosomal repetitive sequences referred to as amplicons. o Overcome by identifying minute variations between amplicon copies. Use these sequence family variants as markers to be ordered for mapping. o Required complete and accurate sequencing and comparison of amplicon variants. o This kind of physical mapping is called sequence mapping. See below. What did the authors do? o mapped and sequenced a tiling path of 220 BAC clones, each containing a portion of the MSY from the same individual. Only one individual was sequenced because they needed to make sure the small variations between amplicons were not polymorphisms in the population. o Obtained finished sequence (err=1 in 10000 bases) for all MSY euchromatin except for two small problematic regions and much of the heterochromatin. What did they find? o MSY includes 156 transcriptional units, half of which likely encode protein. o 60 of the 78 protein coding units fall into 9 MSY-specific gene families (with sequence homology of 98% or greater within the families) o The remaining 18 are single copy genes. o Of the 27 genes or gene families, about half are expressed ubiquitously and the other half only expressed in the testis. o Three distinct classes of sequence make up the euchromatic MSY: 1. X-transposed – No X-Y crossing over. Lowest gene-density. Highest LINE and SINE content. 99% identical to sequences in Xq21. Result of a massive X-Y transposition. 2. X-degenerate – dotted with single-copy gene or pseudogene homologues of 27 different X-linked genes (60-96% identity). Surviving relics of the ancient autosomes from which X and Y evolved. About half of the genes seem to be transcribed and functional, producing protein isoforms of their counterparts on X. All the ubiquitous Y genes are located here. 3. ampliconic – very long repeats (10’s or 100’s of kbs) with very high similarity (as much as 99.9%). Contain the most genes, including the 9 MSY specific gene families named above (with copy numbers ranging from 2-35). Almost exclusively expressed in the testes. Have the lowest LINE and SINE content. Massive Palindromes and inverted repeats o Found 8 huge palindromes in the ampliconic regions. 6 of these contain recognizable coding genes with the gene existing in identical (or nearly identical) form on each arm of the palindrome. o Some of the gene families exist solely on palindromes o Also characterize 5 sets of more widely spaced inverted repeats. Tandem arrays o eg. NORF (no long open reading frame) array consists of an array of ~2.48kb repeats containing a great variety of spliced but apparently non-coding transcriptional units. Definitions: Pseudogenes - Genes bearing close resemblance to known genes at different loci, but rendered non-functional by additions or deletions in structure that prevent normal transcription or translation. When lacking introns and containing a poly-A segment near the downstream end (as a result of reverse copying from processed nuclear RNA into double- stranded DNA), they are called processed genes. Palindrome - A region of DNA which reads the same in the forward and backward directions. An example is 5'-CATG-3' which when both strands are considered is: 5'-CATG-3' 3'GTAC-5' So if the top strand is read, it reads 'CATG' in the forward 5' to 3' direction. If the bottom strand is read in the 5' to 3' direction, it also reads 'CATG' and so is hence a palindrome. - A nucleotide sequence on a DNA molecule in which the same sequence is found on each strand, but in the opposite direction. - A self-complementary nucleic acid sequence, that is a sequence identical to its complementary strand; perfect palindromes (e.g. GAATTC) frequently occur at sites of recognition for restriction enzymes; less perfect palindromes (e.g. TACCTCTGGCGTGATA) frequently occur in binding sites for other proteins, such as repressors. (S) Chromosome Sequencing Physical maps can be divided into three general types: 1. chromosomal or cytogenetic maps 2. radiation hybrid (RH) maps, 3. sequence maps. The different types of maps vary in their degree of resolution, that is, the ability to measure the separation of elements that are close together. The higher the resolution, the better the picture. One goal of physical mapping is to identify a set of overlapping cloned fragments that together encompass an entire chromosome or an entire genome. The resulting physical map is useful in three ways. First, the genetic markers carried on the clones can be ordered and hence contribute to the overall genome mapping process. Second, when the contiguous clones have been obtained, they represent an ordered library of DNA sequences that can be exploited for future genetic analysis. Third, these clones form the raw material that will be sequenced in large-scale genome projects. Radiation hybrid mapping. This technique was designed to produce a higher-resolution map of molecular markers along a chromosome. The procedure is to X-ray treat human cells to fragment the chromosomes and then fuse the irradiated cells with the rodent cells to form a panel of different hybrids. In this case, the hybrids have an assortment of fragments of human chromosomes, as diagrammed in Figure 14-11. Most of the fragments are seen to be embedded in the rodent chromosomes, but truncated human chromosomes also can be found. A standard panel in the range of 100 to 200 radiation hybrids is quite straightforward to obtain. Such a panel is sufficient to obtain a high-resolution cR3000 map of the human genome, which would have 10-fold greater resolution than the current centimorgan genetic map. One downside of the technique is that it is limited to those markers for which human rodent differences are available. Figure 14-15. Using sequence-tagged sites (STSs) to order overlapping clones (YACs, in this example) into a contig. Five different YACs are tested to determine which STSs they contain (top), and these data are used to assemble a physical map (bottom). Sequence Mapping Sequence tagged site, or STS mapping, is another physical mapping technique. An STS is a short DNA sequence that has been shown to be unique. To qualify as an STS, the exact location and order of the bases of the sequence must be known, and this sequence may occur only once in the chromosome being studied or in the genome as a whole if the DNA fragment set covers the entire genome. Common Sources of STSs Expressed sequence tags (ESTs) are short sequences obtained by analysis of complementary DNA (cDNA) clones. Complementary DNA is prepared by converting mRNA into double-stranded DNA and is thought to represent the sequences of the genes being expressed. An EST can be used as a STS if it comes from a unique gene and not from a member of a gene family in which all of the genes have the same, or similar, sequences. Simple sequence length polymorphisms (SSLPs) are arrays of repeat sequences that display length variations. SSLPs that are polymorphic and have already been mapped by linkage analysis are particularly valuable because they provide a connection between genetic and physical maps. Random genomic sequences are obtained by sequencing random pieces of cloned genomic DNA or by examining sequences already deposited in a database. To map a set of STSs, a collection of overlapping DNA fragments from a chromosome is digested into smaller fragments using restriction enzymes, agents that cut up DNA molecules at defined target points. The data from which the map will be derived are then obtained by noting which fragments contain which STSs. To accomplish this, scientists copy the DNA fragments using a process known as "molecular cloning". Cloning involves the use of a special technology, called recombinant DNA technology, to copy DNA fragments inside a foreign host. First, the fragments are united with a carrier, also called a vector. After introduction into a suitable host, the DNA fragments can then be reproduced along with the host cell DNA, providing unlimited material for experimental study. An unordered set of cloned DNA fragments is called a library. Next, the clones, or copies, are assembled in the order they would be found in the original chromosome by determining which clones contain overlapping DNA fragments. This assembly of overlapping clones is called a clone contig. Once the order of the clones in a chromosome is known, the clones are placed in frozen storage, and the information about the order of the clones is stored in a computer, providing a valuable resource that may be used for further studies. These data are then used as the base material for generating a lengthy, continuous DNA sequence, and the STSs serve to anchor the sequence onto a physical map. Introduction to Sequence Mapping (Griffth’s An Introduction to Genetic Analysis) In the preparation of physical maps of genomes, vectors that can carry very large inserts are naturally the most useful. Cosmids, YACs (yeast artificial chromosomes), BACs (bacterial artificial chromosomes), and PACs (phage P1-based artificial chromosomes) have been the main types. In a similar manner, as cloning vectors, they can also carry inserts of fragments of foreign DNA as large as 300 kb, although the average is about 100 kb. PACs are produced by a type of engineering similar to that of phage P1; they carry inserts comparable to those of BACs. Although the maximum insert sizes of BACs and PACs are not as large as those of YACs, the former types have several advantages over YACs. First, they can be amplified in bacteria and isolated and manipulated simply with basic bacterial plasmid technology. Second, BACs and PACs form fewer hybrid inserts than YACs do. Hybrid inserts are composed of several different fragments; their presence can thwart attempts to order the clones. Cloning a whole genome begins by amassing a large number of randomly cloned inserts. The contents of these clones must be characterized in some way, and overlaps must be determined. A set of overlapping clones is called a contig. In the early phases of a genome project, contigs are numerous and represent cloned "islands" of the genome. But, as more and more clones are characterized, contigs enlarge and merge into one another, and eventually the project should end up with a set of contigs that equals the number of chromosomes. Chromosome-specific libraries. If a library of clones is prepared from total genomic DNA, then contig development is relatively slow. However, if a specific chromosome can be used to develop the library of clones, contigs emerge more rapidly. PFGE can be used to isolate individual chromosomes (if they are small) or chromosome fragments cut with "long-cutter" enzymes such as NotI. Flow sorting is another option for preparing DNA of a specific chromosome. Chromosomes (such as human chromosomes) can be flow-sorted by fluorescence-activated chromosome sorting (FACS; Figure 14-13). In this procedure, metaphase chromosomes are stained with two dyes, one of which binds to AT-rich regions and the other to GC-rich regions. Cells are disrupted to liberate whole chromosomes into liquid suspension. This suspension is converted into a spray in which the concentration of chromosomes is such that each spray droplet contains one chromosome. The spray passes through laser beams tuned to excite the fluorescence. Each chromosome produces its own characteristic fluorescence signal, which is recognized electronically, and two deflector plates direct the droplets containing the specific chromosome needed into a collection tube. MESSAGE Genomic cloning proceeds by assembling clones into overlapping groups called contigs. As more data accumulate, the contigs become equivalent to whole chromosomes. Several different techniques are used to order genomic clones into contigs. We shall consider some of the main ones. Ordering by FISH. If good chromosomal landmarks are known, FISH analysis can be used to locate the approximate positions of the large inserts. Figure 14-14 shows results of a FISH analysis that generates a rough ordering of BACs and PACs in human chromosomes. Ordering by clone fingerprints. The genomic insert carried by a vector has its own unique sequence, which can be used to generate a DNA fingerprint. For example, a multiple restriction-enzyme digestion can generate a set of bands whose number and positions are a unique "fingerprint" of that clone. The different bands generated by separate clones can be aligned either visually or by using a computer program to determine if there is any overlap between the inserted DNAs. In this way, the contig can be built up. Ordering by sequence-tagged sites. Unique short sequences of large cloned inserts can be used as tags to align the various clones into contigs. For example, if clone A has tags 1 and 2 and clone B has tags 2 and 3, clones A and B must overlap in the region of tag 2. The practical procedure is to amass a large set of random clones with small genomic inserts (say, in phage) and sequence short regions of each. From these sequences, pairs of PCR primers are designed that will amplify the short specific sequence of DNA flanked by the primers. These short DNA sequences are known as sequence-tagged sites (STSs). Even though initially the location of these STSs in the genome is not known, a panel of many STSs can be used to characterize clones with large genomic inserts (such as YAC clones). The clones that are shown to have specific STSs in common must have overlapping inserts and therefore can be aligned into contigs. An example of this process is shown in Figure 14-15. Short stretches of sequence are sometimes obtained from cDNA clones. These stretches are known as expressed sequence tags (ESTs). ESTs are obtained by sequencing into the cDNA insert by using a primer based on the vector sequence. They can be used to align the cDNAs on the contig, thus anchoring the gene map to the physical map. Further, if part of the open reading frame (ORF) of the transcript is contained within the EST, the "virtual" translation of the ORF can provide a "sneak preview" of the function of the protein encoded by the mRNA from which the cDNA was derived. Furthermore, the DNA of the contigs has been arranged on nitrocellulose filters in ordered arrays; so, to find out where a specific piece of DNA of interest lies in the genome, that DNA is used as a probe on the contig filters, and a positive hybridization signal announces the precise location of the DNA (Figure 14-16). An example: cloning and mapping the human Y chromosome. Several of the smaller human chromosomes have been fully cloned as overlapping sets of YAC clones (contigs). We shall examine the cloning of the Y chromosome as an example because it illustrates several of the techniques of physical mapping. The STS map of the Y chromosome was in fact obtained by two different methods YAC alignment and deletion analysis. YAC alignment. Flow sorting yielded a sample of Y chromosomes, from which clones were made. From clones that did not contain repetitive DNA, STS primers were designed. In all, 160 primer pairs were made. A Y chromosome YAC library of 10,368 clones was obtained in which the average insert size was 650 kb. From these numbers, each point on the Y chromosome was estimated to have been sampled an average of four times. The YAC clones were divided into 18 pools of 576 YACs, and the pools were screened with the STS primers. Subdivision of positive pools led rapidly to the assignment of a particular STS to specific YACs. The total STS content of each YAC was assessed, and overlaps between the YACs were determined in the same way as that shown in the generalized example in Figure 14-15. Deletion analysis. Various types of Y chromosome deletions occur naturally. For example, some XX males contain truncated fragments of the Y, whereas some XY females have deletions of the region containing the maleness (testis-determining) gene (see Chapters 2 and 23). These Y deletions were maintained in cell culture and formed the basis for aligning the Y chromosome STSs. Each deletion was tested for STS content. Because by nature the deletions were nested sets, the STS content could be used not only to develop an STS map, but also to map the coverage of the deletions. The principle is illustrated in Figure 14-17. The STS maps produced by YAC alignment and by deletion analysis were identical. MESSAGE Clones can be arranged into contigs by matching DNA fingerprints, by matching short sequences within cloned segments, and by analyzing deletions. (Source: Griffith’s An introduction to Genetic Analysis) Example: Shaletsky et al (2003) sequenced the MSY region of the Y chromosome. Since previous efforts to sequence repetitive regions failed, they identified minute variations between amplicon copies. They used these variations as markers to sequence map the tiling path of BAC clones for all euchromatin of MSY. Y Haplotype Markers Haplotype refers a set of closely linked genetic markers present on one chromosome which tend to be inherited together Y haplotype is the easiest haplotype to get since it is a single copy, you don’t; need parents. Y haplotype usually have larger haplotypes because recombination is reduced in the Y chromosome and there is a reduced homology with the X- chromosome. Example: Zerjal et al (2003) The Genetic Legacy of the Mongols: o used Y haplotypes markers to identify a Y chromosomal lineage found in 16 populations across Asia with a frequency of ~ 8%. o Pattern of lineage suggested it originated in Mongolia ~ 1000 years ago. o Pattern spread via selection – social selection o The lineage is carried by male-line descendants of Genghis Kahn. Population Genetics Founder Effect o One form of genetic drift occurs when a small group breaks off from a larger population to found a new colony. o This "acute drift," called the founder effect, results from a single generation of sampling, followed by several generations during which the population remains small. o One result of random sampling is that most new mutations, even if they are not selected against, never succeed in entering the population. o The founder effect is probably responsible for the virtually complete lack of blood group B in Native Americans, whose ancestors arrived in very small numbers across the Bering Strait at the end of the last Ice Age, about 20,000 years ago. (Griffiths Modern Genetic Analysis) Genetic Bottleneck o A brief reduction in size of a population which usually leads to random genetic drift. Factors affecting variation o Increasing Variation: Mutation Recombination Migration o Decreasing Variation: Genetic Drift Natural Selection (Can also maintain allele frequency - Ex. Heterozygote advantage: Sickle Cell Anemia – Htz protected from Malaria) Genetic Drift o Random Effects on the gene pool o Eg. 2 Alleles: A1 A2 Freq: 0.5 0.5 Next Generation: 0.49 0.51 Next Generation: 0.48 0.52 Genetic drift has a larger effect in a smaller population Genetic drift would take longer in a larger population size Gene Dispersion Refers to the spreading of different genotypes (Aa Ba => ABAa) Occurs through: 1. Migration 2. Recombination Example: Genghis Kahn (illustrates an interesting cultural form of gene dispersion) Genetic – Geographical Association Refers to a situation where a population moves or it gets isolated. In this case, there is a lack of random assortment and random mating within that gene pool (=All of the alleles available among the reproductive members of a population from which gametes can be drawn). Isolation may come about in 2 ways: 1. Geographically 2. Culturally An example of cultural isolation is the Ashkenazi Jewish population which tends to marry within and thus isolates other populations. Gene Conversion This is defined as the non-reciprocal transfer of genetic information between two genes with a high degree of homology. A meiotic process of directed change in which one allele directs the conversion of a partner allele to its own form. In asci of Ascomycete fungi a 4:4 ratio of alleles is expected after meiosis, yet 6:2 and 5:3 ratios are sometimes observed. A model of recombination, produced by Holliday, suggests that gene conversion may be explained by repair of heteroduplex DNA. A type of nonreciprocal recombination event in which a recipient strand of DNA receives information from another strand having an allelic difference. The recipient strand has its original allele "converted" to the new allele as a consequence of the event. Gene conversion can sometimes involve transfer between repeated sequences on the same chromosome. Figure 9.10. Gene conversion involves a nonreciprocal sequence exchange between allelic or nonallelic genes. (A) Interallelic gene conversion. Note the nonreciprocal nature of the sequence exchange - the donor sequence is not altered but the acceptor sequence is altered by incorporating sequence copied from the donor sequence. (B) Interlocus gene conversion. This is facilitated by a high degree of sequence homology between nonallelic sequences, as in the case of tandem repeats. (C) Mismatch repair of a heteroduplex. This is one of several possible models to explain gene conversion. The model envisages invasion by one strand of the donor sequence (-) to form a heteroduplex with the complementary (+) strand of the acceptor sequence, thereby displacing the other strand of the acceptor. Mismatch repair enzymes recognize the mispaired bases in the heteroduplex and 'correct' the mismatches so that the (+) acceptor sequence is 'converted' to be perfectly complementary in sequence to the (-) donor strand. Subsequent replication of the (-) acceptor strand and sealing of nicks results in completion of the conversion. (Source: Strachan and Reed Human Molecular Genetics 2) Homologous sequences. Orthologs and Paralogs are two types of homologous sequences. Orthology describes genes in different species that derive from a common ancestor. Orthologous genes may or may not have the same function. Paralogy describes homologous genes within a single species that diverged by gene duplication. Mechanisms of Gene Conversion (See Fig. 9.10 above for illustration) There are several different theories or hypotheses about the mechanisms of gene conversion. Homology directed double strand break repair – occurs in homologous chromosomes or sister chromatids When the damaged DNA and template DNA are slightly different Mismatches are then corrected by base excision repair or nucleotide excision repair. Thus, double stranded breaks are repaired by homologous recombination where the broken chromosome is patched up by copying information from a homologous chromosome or sister chromatid. The following figure illustrates the difference between gene conversion and DNA crossover. Comparison between gene conversion and DNA crossover. (a) Two DNA molecules. (b) Gene conversion - the red DNA donates part of its genetic information (e-e' region) to the blue DNA. (c) DNA crossover - the two DNAs exchange part of their genetic information (f-f' and F-F'). An origin of gene conversion. (a) Heteroduplexes formed by the resolution of Holliday structure or by other mechanisms. (b) The blue DNA uses the invaded segment (e') as template to "correct" the mismatch, resulting in gene conversion. (c) Both DNA molecules use their original sequences as template to correct the mismatch. Gene conversion does not occur. The widely accepted model for genetic recombination was first proposed by Robin Holliday in 1964. It involves several steps as illustrated in the following figure. Homologous Recombination Homologous recombination occurs between two homologous DNA molecules. It is also called DNA crossover. During meiosis, two homologous pairs of sister chromatids align side by side. The DNA crossover is very likely to occur. It could be as often as several times per meiosis. DNA crossover. (a) Two homologous pairs of sister chromatids align side by side. (b) The two homologs are connected at a certain point called chiasma. (c) The two homologs exchange the DNA segment from the chiasma to the end of chromosomes. The Holliday model of DNA crossover (genetic recombination). (a) Two homologous DNA molecules line up (e.g., two nonsister chromatids line up during meiosis). (b) Cuts in one strand of both DNAs. (c) The cut strands cross and join homologous strands, forming the Holliday structure (or Holliday junction). (d) Heteroduplex region is formed by branch migration. (e) Resolution of the Holliday structure. Figure 8-D-2e is a different view of the Holliday junction than Figure 8-D-2d. DNA strands may be cut along either the vertical line or horizontal line. (f) The vertical cut will result in crossover between f-f' and F-F' regions. The heteroduplex region will eventually be corrected by mismatch repair. (g) The horizontal cut does not lead to crossover after mismatch repair. However, it could cause gene conversion. Gene Duplication The presence of an extra segment of DNA, resulting in redundant copies of a portion of a gene, an entire gene, or a series of genes, usually caused by unequal crossing-over during gene replication when gametes are formed in meiosis. Mechanisms of Gene Duplication Due to significant sequence homology, the repetitive sequence region of a chromatid may not line up exactly with its corresponding region in a homologous chromatid or identical sister chromatid. As a result, different number of repeat units may be generated during meiosis. This is thought to be the major mechanism underlying VNTR (variable number of tandem repeats ). Unequal crossover and sister chromatid exchange. (a) Two pairs of sister chromatids line up during meiosis. A repetitive region of one chromatid (the third one) does not line up exactly with its corresponding region in other chromatids. (b) Strand breaks on nonsister chromatids (along line A) will result in unequal crossover, producing different number of repeat units in these chromatids. (c) Strand breaks on sister chromatids (along line B) also produce different repeats. In this case, it is called sister chromatid exchange. The detailed mechanism of DNA crossover in (b) and (c) may be explained by the Holliday model. Comparative Genomics The study of comparing complete genome sequences, to understand general principles of genome structure and function The study of human genetics by comparisons with model organisms such as mice, the fruit fly, and the bacterium E. coli. A comprehensive view of large-scale changes in synteny, gene order, and regions of nonconservation while simultaneously affording exquisite molecular resolution at the level of the nucleotide. Example: Comparative analysis of the Y chromosome using human and ape genomes (See Shaletshy et al (below) for details). Assigned readings: Skaletsky et al. 2003. Abundant gene conversion between arms of palindromes in human and ape Y chromosomes. What did they do? Looked at human MSY palindromes. These are very large palindromes with very high intra-palindromic sequence identity (99.87%). This high degree of identity implicates a recent gene duplication (gene amplification?) However, comparative analysis with monkeys finds very similar genes and indicates the palindromes might predate the human-primate divergence. They suggest gene conversion as an alternative explanation for the high degree of similarity between the palindrome arms. How did they test this hypothesis? looked for MSY palindromes in common ancestors of humans and chimpanzees – ie. Chimpanzees, bonobos, and gorillas. Used PCR to amplify inner and outer boundaries of all 8 palindromes for all species. Identified (by sequence homology) and sequenced chimp BACs corresponding to 4 of 8 palindromes. Studied a C/T SNP using 171 unrelated men that are known to represent 42 distinct branches of a robust tree of human Y chromosome genealogy. See figure in paper. What did they find? Most palindromes found in the human MSY were present in the human-primate common ancestor. Inner boundaries are more conserved than outer boundaries. BAC chimp palindrome sequencing revealed only 1.44% sequence divergence. Probably just represents accumulation of neutral mutations in human and chimp lineages after separation. Found same extremely high sequence identity between two arms of palindrome in chimps as in humans. Y chromosome genealogy data show predominantly C on both arms (C/C). A C/T genotype recently arose but was subsequently and rapidly converted to T/T or back to C/C by gene conversion. Conclusions MSY palindromes predate separation of human and chimp lineages Paired arms of palindromes (for humans and chimps) evolved in concert. Gene conversion between the two arms removes polymorphisms as they arise. High degree of similarity between arms indicates a steady state balance between new mutations and gene-conversion events that erase them. Calculate that on average, 600 duplicated nucleotides undergo arm-to-arm gene conversion for every son that is born in recent evolutionary history. MSY (male specific region) used to be known as NRY (non-recombining region) but in fact, it recombines within itself at a very high rate. Conservation between humans and chimps in the palindromic regions is extremely high even for non-functional sequences (eg. Alu repeats). Suggest there might be a slight bias of gene conversion towards original sequence. Cast doubt on molecular clock methods that give recent dates to gene duplications because of high similarity. If these duplications are also present in ancient ancestors, they could be much older and similarities between duplications may be the result of a gene conversion mechanism as proposed here. Suggest that this mechanism may have evolved to combat the evolutionary decay observed elsewhere in the Y chromosome and explains why most intact testisspecific genes are located in the palindromic arms. A safe place to keep important genes. Evolution of Autosomal Chromosomes Whole Genome Duplication: o Ex/Proof: Drosophila gene: had 4 human paralogues 4 N -- (Mutate) 2N -- (Mutate) 2N (both mutated so much that become different from original) Subgenomic Duplication o Methods: 1. Non-reciprocal homologous recombination ↑ variation 2. Non-reciprocal sister chromatid exchange 0 variation, because it is an exact switch 3. Transposition: One gene jumps out of spot and moves somewhere else. Retrotransposons – RNA copy from DNA and synthesize DNA Duplication is an essential evolutionary mechanism, as one copy is allowed to change because the other copy provides a back-up for functional purposes. 1 2 Figure of two homologous chromosomes. 1: Non-reciprocal homologous recombination 2: Non-reciprocal sister chromatid exchange Assigned Reading: Eichler and Sankoff. 2003. Structural dynamics of eukaryotic chromosome evolution. What points are they trying to make? Chromosomes evolve by modification, acquisition, deletion, and/or rearrangement of genetic material. Explain Synteny: Fragile versus random breakage models. Chromosome structure for 2 eukaryotic genomes with a common ancestor is altered by intrachromosomal inversions or reciprocal interchromosomal translocations. Conserved synteny is observed when a number of sequence markers map to a single chromosome in each genome, irrespective of order. If the markers have the same order we call it a homologous segment The random breakage model hypothesizes that breakpoints between homologous segments will be uniformly random. When chromosomes are considered on a large, low-res scale, this is generally true. The fragile breakage model arose as genomes were completely sequenced and a much higher resolution analysis conducted. When micro-rearrangements are considered (small inversions, deletions, or transpositions within otherwise conserved segments) a lot more breakpoints are found at highly variable concentrations between chromosomal regions. What is unusual about Centromeric and Telomeric regions? Unusually dynamic Sequencing and annotation remains spotty because of technical problems Often composed of tandem arrays of repetitive sequences Transition regions between centromere and telomeres and rest of sequence (subtelomeric and pericentromeric) are HOTSPOTS for insertions or retention of repeat sequences and contain blocks of recently duplicated DNA. Frequent non-homologous exchanges Show radical changes in gene order, harbor novel sequences, extensive genomic rearrangement, preferential sites of reciprocal translocations. Why are duplications important? Engines of gene and genome evolution. Whole-genome duplications – cataclysmic genomic events. Require formation of a tetraploid (4n). Disomic state is gradually reestablished after extensive rearrangements and deletions. Segmental duplications – Duplication of small portions of chromosome in tandem or transposed to new locations in genome. Often observed as tandem arrays of gene families. Can obscure orthologous relationships and promote nonallelic homologous recombination. As much as 15-50% of genes owe their existence to genomic duplications. Segmental duplications tend to create gaps in conserved synteny between species because of extensive rearrangements, functional diversification, or concerted evolution of gene family members. Human genome is unique in having a large proportion of recent segmental duplications that are interspersed. These promote further rearrangement through their own misalignment and subsequent nonallelic homologous recombination. What effect do Transposable elements have on the Chromosomal landscape? Eukaryotic genomes contain substantially differing amounts of repetitive DNA because of the different propagation and deletion of selfish genetic elements. LINEs, SINEs, and long terminal-repeats (LTRs) propagate by reversetranscription of an RNA intermediate. DNA transposons move by a direct “cut and paste” mechanism. Rates of retrotransposition and deletion vary a great deal even between closely related species. In some species (cereal grains) it is so extensive as to lead to a doubling in genome size. This species require counter-balancing mechanisms like illegitimate recombination, and unequal homologous recombination to prevent “genome obesity”. Repeats are not randomly distributed. Eg. L1 repeats prefer gene-poor AT-rich regions. These biases probably reflect differences in selective constraint and recombination. Chromosomal rearrangements and repeats: Cause or consequence? Clearly repetitive elements are related to chromosomal rearrangements. But, in some cases it may be the cause and in others the consequence. It has not been clearly determined yet. Evolution of X & Y Chromosomes Arouse from ordinary pair of autosomes ~ 300 million years ago X-Y divergence begun when X-Y cross-over ceased (↑ evolution - ↓ cross-over) Inversions on the Y chromosome may have suppressed crossing over with X chromosome. (Eg. X-degenerate Region of MSY is the product of evolutionary suppression of (region by region ) of crossing over in ancestral autosomes) X-Chromosome: o High degree of conservation, o therefore, low evolution, and o high degree of synteny between distant relatives Y-Chromosome: o High degree of rapid evolution, and o therefore unconstrained chromosomal evolution o Low recombination within chromosome o High degree of homology mediated rearrangement o High degree of gene conversion within duplicated sequences o Low degree of synteny. Mechanism of Evolution: 1. Mutation During time when sex was determined by temperature fluctuations A mutation event in the promoter which was constitutively switched on could lead sex determination despite temperature fluctuations. Therefore this mechanism which may be passed on for generations could lead to a genetic sex. 2. Inversions inversion inhibit recombination Allow mutations to accumulate Accumulations of non-functional genes Series of deletions of these genes (no consequence because non-functional) lead to an overall reduction in size of the Y- chromosome. Referred to as Muller’s Ratchet Theory (accumulations of deletions with no consequences). Thus, many believe that the Y-chromosome will be wiped out and male sex may be determined by a single X, like in Drosophila. PAR: pseudoautosomal region Euchromatic Region* Non-recombining region = MSY Heterochromatic Region PAR: pseudoautosomal region Y CHROMOSOME *3 Classes in Euchromatic MSY: 1. X-transposed 2. X-degenerate 3. Ampliconic MSY is well conserved from mice humans All other regions are not conserved ( areas of recombination) DNA Microarray Technology Microarray analysis is an umbrella term that is often used to refer to cDNA microarrays, oligonucleotide arrays, and SAGE. Please refer to Block #1 for details Analysis of DNA Microarray Data, Statistical Significance Normalization and Noise: Normalization Some kind of normalization is usually required when comparing more than one microarray experiment. Adjust to account for differences in overall brightness of slides Normalize relative to housekeeping genes Noise Refers to variability and reproducibility of microarray experiments Intra and inter-microarray variations can significantly skew interpretation of data Sample collection is very important. If comparing two conditions you must control for all variables other than the one you are trying to measure Technical noise can result from imperfections in the chip. Both biological and technical replicates are required to measure and control these sources of noise Supervised versus Unsupervised: Supervised analysis to determine genes that fit a predetermined pattern Usually used to find genes with expression levels that are significantly different between groups of samples or finding genes that accurately predict a characteristic of the sample Two popular supervised techniques would be nearest-neighbour analysis and support vector machines. Unsupervised analysis to characterize the components of a data set without a priori input or knowledge of a training signal Try to find internal structure or relationships in data without trying to predict some ‘correct answer’. Three classes: 1. Feature determination Look for genes with interesting patterns Eg. Principal-components analysis 2. Cluster determination Determine groups of genes with similar expression patterns eg. Nearest-neighbour clustering, self-organizing maps, k-means clustering, 2d hierarchical clustering 3. Network determination Determine graphs representing gene-gene or gene-phenotype interactions. Eg. Boolean networks, Bayesian networks, relevance networks Distance metrics versus clustering methods Distance Metrics Measure of dissimilarity – indicates degree of similarity Pearson Correlation Absolute Pearson correlation Uncentred Pearson Absolute uncentred Pearson Euclidian distance Harmonically summed Euclidian distance. Spearman's Rank Kendall's T City-block distance Clustering methods Builds on dissimilarity measures to create groups with similar features 1. 2. 3. 4. 5. 6. Hierarchical Clustering Self-Organizing Maps Relevance networks Principle Components analysis Nearest neighbours Support Vector machines Figure 3 | Clustering and network-determination methods used in microarray analysis. The choice of the proper method and the results obtained clearly depend on the starting hypothesis. This figure shows the results of six analytical methods applied to the same hypothetical data set a | Hierarchical clustering sorts all genes (or samples), such that similar genes appear near each other. The length of the branch is inversely proportional to the degree of similarity. Shades of red indicate increased relative expression; shades of green indicate decreased relative expression. b | Self-organizing maps find variable-sized clusters of genes that are similar to each other, given the input number of clusters to find. c | Relevance networks find and display pairs of genes with strong positive and negative correlations, then construct networks from these gene pairs; typically, the strength of correlation is proportional to the thickness of the lines between genes, and red indicates a negative correlation. d | Principal-components analysis is typically used as a visualization technique, showing the clustering or scatter of genes (or samples) when viewed along two or three principal components. In the figure, a principal component can be thought of as a ‘meta-biological sample’, which combines all the biological samples so as to capture the most variation in gene expression. e | The nearest-neighbour supervised method first involves the construction of hypothetical genes that best fit the desired patterns (for example, a gene with high expression in disease 1 and low expression in disease 2, or vice versa). The technique then finds individual genes that are most similar to the hypothetical genes. f | Instead of restricting to individual genes, support vector machines efficiently try several mathematical combinations of genes to find the line (or plane) that best separates groups of biological samples. CD3G, CD3G antigen, γ-polypeptide; CD28, CD28 antigen; IL-24, interleukin-24; PHKB, phosphorylase kinase-β; PTGS2, prostaglandin-endoperoxidase synthase 2; PXF, peroxisomal farnesylated protein; TCF12, transcription factor 12; STAT1, signal transducer and activator of transcription 1. Post analysis Challenges The rate limiting step is the post-analytical work where you try to determine what the actual results mean. Names and information for ‘genes’ on slide may be unknown or ambiguous Probes that were initially thought to be unique might later be found to hit two or more genes. Probes are sometimes designed against chromosomal regions and it is unclear which gene they correspond to. The analysis of a microarray experiment is never complete because we are still learning about the genes on the array. Assigned readings: Ramaswarmy et al. 2003. A molecular signature of metastasis in primary solid tumours. Stuart et al. 2003. A gene-coexpression network for global discovery of Conserved genetic modules Useful reference: Butte A. 2002. The use and analysis of microarray data. Nat Rev Drug Discov. 1(12):951-60. Validation and Limitations of DNA Microarrays Limitations DNA microarray studies are often not “hypothesis driven”, and have been associated with “fishing”. This being said, microarray studies can often generate hypothesis driven experiments DNA microarray experiments requires large amount of high quality sample, generally 50-200g Will only pick up genes that are highly expressed, may miss genes that are upregulated, but still not above multiple fold increase desired and determined by researcher, or in such low copy number not picked up at all by array. Reproducibility - Need to standardize technology to enable comparison of data between labs o Introduction of artifacts is possible at any time during array experiment No consensus about how to interpret gene expression patterns of hypothetical genes, genes of unknown function or transcripts identified only by ESTs Hard to handle large amounts of data that is produced. Validation Experimental Quality Control: o An up front validation is achieved by including separate regions of each gene as an individual target on the microarray o Optimization of each experiment is required o To eliminate background noise and false positives, perform repeats of experiments. Multiple arrays Replicates of each RNA sample Replicates of RNA preparation o During Image Acquisition Normalization of genes of interests’ expression to a house keeping gene. Background subtraction Data processing Standardization of data (ie. Only keep genes that were expressed at least 2 fold higher than in control) Use of visualization tools o There is no standard for numerical analysis Mostly achieved using high end computational methods Clustering Pattern identification 2 main validation techniques: o In Silico Compare your array results with the information available in the literature, or in public and private databases. Cross reference microarray results for agreement with known expression information in literature. This validates general performance of your microarray system, and provides confidence in overall data. This will become more useful when standardized methods of reporting results are uniformly applied. o Laboratory Based Use of independent experimental verification of gene expression levels. Typically use the same samples as were used on the array Methodology used depends on the experimental question, but often includes semi-quantitative RT-PCR (reverse transcription PCR), Real time RT-PCR, Northern Blot, ribonuclease protection assay, in situ hybridization or immunohistochemistry with tissue microarrays. Also need to validate Universality of results o Determine if the experimental profiles are a universal feature of the biological phenomenon under study. o Are the data an essential descriptor of biological state? Address this via evaluation or a critical gene set in larger and more extensive study group (i.e. In Silico or In Lab) Recommended Read: Chuaqui RF et al. Post-analysis follow-up and validation of microarray experiments. (2002) Nature Genetics Supplement (32) 509-514 Statistical Significance versus Biological Significance Biological significance "Biological significance" refers to a statistically significant effect that has a noteworthy impact on health or survival. If an observed effect is small but quite precise (i.e., there is little uncertainty in the observed value), the effect can be statistically significant even if it isn't biologically significant. For example, a factor that causes a decrease in blood pressure of 1 mmHg on the average can be statistically significant if tested in a large group of people, but an average reduction of 1 mmHg in blood pressure has no practical clinical implication per se. Statistical significance "Statistical significance" means statistical analysis has revealed an effect unlikely to have occurred by chance alone. The level of significance refers to the degree to which the result could be explained by chance. At the .05 (5%) level the result could have occurred by chance 1 time in 20; at the .01 (1%) level the result could have occurred by chance only 1 time in 100. Any effect observed in a study or experiment carries with it some degree of uncertainty, or imprecision, because of randomness and variability in most biological phenomena. Statistical techniques evaluate an observed effect in view of its precision to determine with what probability it might have arisen by chance (the level of significance). Values with a low probability of occurring by chance are called "statistically significant" and are thought to represent a real effect. So statistical significance tells you if there is a difference, while biological significance tells you whether that difference is important. Gene Networks A method to examine cellular processes and functionally characterize genes by identifying groups of genes that appear to be co expressed Based on the concept that genes that are co expressed are associated and are functionally contributing together in pathways. Rationale: if a gene is linked in a network to many genes that participate in the same biological process, it is reasonable to hypothesize that it also participates in the process. Problems: o May miss genes known to be involved in pathways o No consensus exists as to how to interpret the gene expression patterns of : Hypothetical genes Genes of unknown function Transcripts identified only by ESTs o Inefficiently samples biological variability of a system o Statistical significance ≠ biological significance Identification of Pathways Rationale: By using gene networks approach, proteins located in the same pathway should be expressed at the same time. However, microarrays pick up so much expression data that although may be statistically significant, it may not necessarily be biologically significant. Therefore, a better approach would be to look at expression across species. Comparative Genomics to Functional Phylogenomics Due to the above limitations, a better method would be to look at gene expression across species to see gene that are coexpressed. Genes that are co-expressed across species are more likely to be functionally significant. Therefore phylogenomics is a method of increasing biological significance by studying co-expression across species (instead of multiple number of times within one species, aka statistical significance). Phylogenomics: The study of the evolution of genes and gene families using DNA sequence information from organisms selected at major branch points along the phylogenetic continuum. Phylogenetic: Of or pertaining to the history of ancestry and descent.