Chapter 20: Genomics and Proteomics In 1920, geneticists turned from the study of individual genes to focus on the entire genome of an organism. The goal became to map out all of the genes found in an entire organism, and geneticists developed a two part approach: 1. Identify spontaneous mutations or collect mutants by using chemical or physical agents 2. Generate genetic maps by linkage analysis using the mutant strand This method is the backbone of genetic analysis and is still used today. However, using mutations have limits. 1. At least one mutation per gene must be present for mutational analysis to be used. 2. Obtaining mutations is time consuming 3. Some mutations are lethal and some have no clear phenotype, making it difficult to map the mutated gene Beginning in the 1980s, recombinant DNA technology was introduced as a way to map the human genome. A new method, positional cloning, was developed and used to isolate and map genes one at a time. By the mid 1980s, around 3500 genes had been identified and mapped. In 1977, Fred Sanger and colleagues began the study of genomes (genomics) using a method developed to map the genome of a virus. Genomics includes several subfields: structural genomics, functional genomics, and comparative genomics. Proteomics, the study of proteins, came as an outgrowth of genomics. The Human Genome Project developed as an international effort to determine the sequence of the 3.2 billion base pairs making up the human genome and to identify all of the genes in it. 20.1 Genomics: sequencing is the basis for identifying and mapping all genes in a genome Clone-by-clone method: construction of cloned libraries of large fragments that include the entire DNA in an organism’s genome. The clones are assembled into genetic and physical maps encompassing the entire genome. The nucleotide sequence is determined clone by clone until the entire length is sequenced. This method depends on restriction maps and large amounts of clones to sequence. This method is time consuming. Shotgun method: two or three preparations of genomic DNA are made. One is cut into short fragments, another is cut in longer portions, and the third is made in much larger portions. A library is made from each preparation and clones are selected at random and sequenced. Software is used to assemble long stretches of sequence from overlapping fragments from the library, using the sequences from the larger clones as framework. This method was developed by Craig Venter and colleagues in 1995. 20.2 An overview of genomic analysis Compiling the Sequence – the genome is sequenced more than once to determine that it is error free. The HGP sequenced the human genome 12 times using the shotgun process. The privately run process examined a portion of the genome more than 35 times. A draft was finished in 2001, with some parts unfinished. A final version was published in 2003. Annotating the Sequence – Annotation is a process that identifies genes, their regulatory sequences, and their functions. It also identifies no protein coding genes (like rRNA, tRNA, and small nuclear RNA). Locating protein-coding genes is done by analyzing the sequence using software. Protein-coding genes are composed of open reading frames (ORFs), nucleotide triplets that can be translated. Since sequences are read three bases at a time, it is unclear where to begin studies. Analysis then begins at the first nucleotide and searches for the ORF. Searching for ORFs starting with an ATG followed by a termination sequence is one strategy for finding genes. 20.3 Functional genomics classifies genes and identifies their functions After annotation, assigning functions to the genes comes next. Some have already had functions assigned by the classical methods, but many have no function assigned. One approach used homology searches, which involves similar genes isolated from other organisms and comparing the new gene with the similar gene. 20.7 The Human Genome Project (HGP) The HGP goal was to determine the human genome by using recombinant DNA technology and DNA sequencing instead of mutational analysis. The HGP has produced much information, and much of that still requires interpretation. What is known though, humans and other organisms share a common set of genes essential for cellular function and reproduction. In 1990, the Human Genome Project began under the direction of James Watson. It was designed to sequence the entire DNA in the human genome, to identify and map the thousands of genes in chromosomes, and establish the function of all genes. The HGP also set up the ELSI program (Ethical, Legal, and Social Impact) to ensure that genetic information would be used in the proper way. Major features of the human genome – In February 2001, about 96% of the euchromatic region (areas that contain most of the structural genes) of the DNA had been analyzed. The remaining work was finished by 2003, and attention is now directed at analyzing the data. The unfinished tasks in human genome sequences – Two types of gaps remain in the sequence. 324 gaps remain in the euchromatin portion, most of which have duplicated regions that are difficult to assemble. Other major gaps include areas in the heterochromatic regions (areas thought to lack structural genes). Major Features of the Human Genome 1. It contains more than 3 billion nucleotides, but protein-coding sequences make up only about 5% of the genome 2. It contains between 25000 and 30000 genes 3. More than 40% of the genes identified have no known molecular function 4. Genes are not uniformly distributed on the chromosomes. Gene-rich areas are separated by gene-poor areas that account for 20% of the genome. Chromosome 19 has the highest gene density, while chromosome 13 and the Y chromosome have the lowest density. 5. Human genes are larger and contain more introns than genes in invertebrates like Drosophila. 20.8 Comparative genomics is a versatile tool Comparative genomics uses a variety of techniques and resources, including construction and use of databases containing nucleic acid and amino acid sequences, gene mapping, and mutagenesis. 20.10 Proteomics identifies and analyzes the proteins in a cell Proteomics defines the complete set of proteins encoded by a genome. It can be used to describe a set of proteins expressed in a cell at a given time. In most genomes sequenced to date, many newly discovered genes have no known function. In the human genome, about 41% of genes are of unknown function, but it is estimated that 40-60% of human genes produce more than one protein.