CHAPTER 22 Application and Experimental Questions E1. With regard to DNA microarrays, answer the following questions: A. What is attached to the slide? Be specific about the number of spots, the lengths of DNA fragments, and the origin of the DNA fragments. B. What is hybridized to the microarray? C. How is hybridization detected? Answer: A. A DNA microarray is a small slide that is dotted with many different fragments of DNA. In some microarrays, DNA fragments, which were made synthetically (e.g., by PCR), are individually spotted onto the slide. The DNA fragments are typically 500 to 5,000 bp in length, and a few thousand to tens of thousands are spotted to make a single array. Alternatively, short oligonucleotides can be directly synthesized on the surface of the slide. In this process, the DNA sequence at a given spot is produced by selectively controlling the growth of the oligonucleotide using narrow beams of light. In this case, there can be hundreds of thousands of different spots on a single array. B. In most cases, fluorescently labeled cDNA is hybridized to the microarray, though labeled genomic DNA or RNA could also be used. C. After hybridization, the array is washed and placed in a scanning confocal fluorescence microscope that scans each pixel (the smallest element in a visual image). After correction for local background, the final fluorescence intensity for each spot is obtained by averaging across the pixels in each spot. This results in a group of fluorescent spots at defined locations in the microarray. E2. For two-dimensional gel electrophoresis, what physical properties of proteins promote their separation in the first dimension and the second dimension? Answer: In the first dimension (i.e., in the tube gel), proteins migrate to the point in the gel where their net charge is zero. In the second dimension (i.e., the slab gel), proteins are coated with SDS and separated according to their molecular mass. E3. Can two-dimensional gel electrophoresis be used as a purification technique? Explain. Answer: Yes, two-dimensional gel electrophoresis can be used as a purification technique. A spot on a two-dimensional gel can be cut out, and the protein can be eluted from the spot. This purified protein can be subjected to tandem mass spectroscopy to determine peptide sequences within the protein. It should be mentioned, however, that two-dimensional gel electrophoresis would not be used to purify proteins in a functional state. The exposure to SDS in the second dimension would denature proteins and probably inactivate their function. E4. Explain how tandem mass spectroscopy can be used to determine the sequence of a peptide. Once a peptide sequence is known, how is this information used to determine the sequence of the entire protein? Answer: In tandem mass spectroscopy, the first spectrometer determines the mass of a peptide fragment from a protein of interest. The second spectrometer determines the masses of progressively smaller fragments that are derived from that peptide. Because the masses of each amino acid are known, the molecular masses of these smaller fragments reveal the amino acid sequence of the peptide. With peptide sequence information, it is possible to use the genetic code and produce DNA sequences that could encode such a peptide. More than one sequence is possible due to the degeneracy of the genetic code. These sequences are used as query sequences to search a genomic database. This program will (hopefully) locate a match. The genomic sequence can then be analyzed to determine the entire coding sequence for the protein of interest. E5. Describe the two general types of protein microarrays. What are their possible applications? Answer: The two general types of protein microarrays are antibody microarrays and functional protein arrays. In an antibody microarray, many different antibody molecules, each one recognizing a different peptide sequence, are spotted onto the array. Cellular proteins are isolated, fluorescently labeled and exposed to the microarray. When a given protein is recognized by an antibody, it will be captured by the antibody and remain bound to the spot. Because each antibody recognizes a different peptide sequence, this microarray can be used to monitor protein expression levels. A functional protein microarray involves purifying cellular proteins and spotting them onto a slide. This type of microarray can be analyzed with regard to substrate specificity, drug binding, and/or protein-protein interactions. E6. What is a motif? Why is it useful for computer programs to identify functional motifs within amino acid sequences? Answer: A motif is a sequence that carries out a particular function. There are promoter motifs, enhancer motifs, and amino acid motifs that play functional roles in proteins. For a long genetic sequence, a computer can scan the sequence and identify motifs with great speed and accuracy. The identification of amino acid motifs helps a researcher to understand the function of a particular protein. E7. Discuss why it is useful to search a database to identify sequences that are homologous to a newly determined sequence. Answer: By searching a database, one can identify genetic sequences that are homologous to a newly determined sequence. In most cases, homologous sequences carry out identical or very similar functions. Therefore, if one identifies a homologous member of a database whose function is already understood, this provides an important clue regarding the function of the newly determined sequence. E8. The secondary structure of 16S rRNA has been predicted using a computer-based sequence analysis. In general terms, discuss what type of information is used in a comparative sequence analysis, and explain what assumptions are made concerning the structure of homologous RNAs. Answer: In a comparative approach, one uses the sequences of many homologous genes. This method assumes that RNAs of similar function and sequence have a similar structure. For example, computer programs can compare many different 16S rRNA sequences to aid in the prediction of secondary structure. E9. Discuss the basis for secondary structure prediction in proteins. How reliable is it? Answer: The basis for secondary structure prediction is that certain amino acids tend to be found more frequently in helices or β sheets. This information is derived from the statistical frequency of amino acids within secondary structures that have already been crystallized. Predictive methods are perhaps 60 to 70% accurate, which is not very good. E10. To reliably predict the tertiary structure of a protein based on its amino acid sequence, what type of information must be available? Answer: The three-dimensional structure of a homologous protein must already be solved by X-ray crystallography before one can attempt to predict the three-dimensional structure of a protein of interest based on its amino acid sequence. E11. In this chapter, we considered a computer program that can translate a DNA sequence into a polypeptide sequence. A researcher has a sequence file that contains the amino acid sequence of a polypeptide and runs a program that is opposite to the TRANSLATION program. This other program is called BACKTRANSLATE. It can take an amino acid sequence file and determine the sequence of DNA that would encode such a polypeptide. How does this program work? In other words, what are the genetic principles that underlie this program? What type of sequence file would this program generate: a nucleotide sequence or an amino acid sequence? Would the BACKTRANSLATE program produce only a single sequence file? Explain why or why not. Answer: The BACKTRANSLATE program works by knowing the genetic code. Each amino acid has one or more codons (i.e., three-base sequences) that are specified by the genetic code. This program would produce a sequence file that was a nucleotide base sequence. The BACKTRANSLATE program would produce a degenerate base sequence because the genetic code is degenerate. For example, lysine can be specified by AAA or AAG. The program would probably store a single file that had degeneracy at particular positions. For example, if the amino acid sequence was lysine–methionine–glycine– glutamine, the program would produce the following sequence: 5–AA(A/G)ATGGG(T/C/A/G)CA(A/G) The bases found in parentheses are the possible bases due to the degeneracy of the genetic code. E12. In this chapter, we considered a computer program that can translate a DNA sequence into a polypeptide sequence. Instead of running this program, a researcher could simply look the codons up in a genetic code table and determine the sequence by hand. What are the advantages of running this program compared to the old-fashioned way of doing it by hand? Answer: The advantages of running a computer program are speed and accuracy. Once the program has been made, and a sequence file has been entered into a computer, the program can analyze long genetic sequences quickly and accurately. E13. To identify the following types of genetic occurrences, would a program use sequence recognition, pattern recognition, or both? A. Whether a segment of Drosophila DNA contains a P element (which is a specific type of transposable element) B. Whether a segment of DNA contains a stop codon C. In a comparison of two DNA segments, whether there is an inversion in one segment compared to the other segment D. Whether a long segment of bacterial DNA contains one or more genes Answer: A. To identify a specific transposable element, a program would use sequence recognition. The sequence of P elements is already known. The program would be supplied with this information and scan a sequence file looking for a match. B. To identify a stop codon, a program would use sequence recognition. There are three stop codons that are specific three-base sequences. The program would be supplied with these three sequences and scan a sequence file to identify a perfect match. C. To identify an inversion of any kind, a program would use pattern recognition. In this case, the program would be looking for a pattern in which the same sequence was running in opposite directions in a comparison of the two sequence files. D. A search by signal approach uses both sequence recognition and pattern recognition as a means to identify genes. It looks for an organization of sequence elements that would form a functional gene. A search by content approach identifies genes based on patterns, not on specific sequence elements. This approach looks for a pattern in which the nucleotide content is different from a random distribution. The third approach to identify a gene is to scan a genetic sequence for long open reading frames. This approach is a combination of sequence recognition and pattern recognition. The program is looking for specific sequence elements (i.e., stop codons) but it is also looking for a pattern in which the stop codons are far apart. E14. Here is short nucleotide sequence within a gene. Via the internet (e.g., see www.ncbi.nlm.nih.gov/Tools), determine what gene this sequence in found within. Also, determine the species in which this gene sequence is found. 5’-GGGCGCAATTACTTAACGCCTCGATTATCTTCTTGCGCCACTGATCATTA-3’ Answer: This sequence is within the lacY gene of the lac operon of E. coli. It is found on page 588, nucleotides 801-850. E15. Membrane proteins often have transmembrane regions that span the membrane in an α-helical conformation. These transmembrane segments are about 20 amino acids long and usually contain amino acids with nonpolar (i.e., hydrophobic) amino acid side chains. Researchers can predict whether a polypeptide sequence has transmembrane segments based on the occurrence of segments that contain 20 nonpolar amino acids. To do so, each amino acid is assigned a hydropathy value, based on the chemistry of its amino acid side chain. Amino acids with very nonpolar side chains are given a high value, whereas amino acids that are charged and/or polar are given low values. The hydropathy values usually range from about +4 to –4. Computer programs have been devised that scan the amino acid sequence of a polypeptide and calculate values based on the hydropathy values of the amino acid side chains. The program usually scans a window of seven amino acids and assigns an average hydropathy value. For example, the program would scan amino acids 1–7 and give an average value, then it would scan 2–8 and give a value, then it would scan 3–9 and give a value, and so on, until it reached the end of the polypeptide sequence (i.e., until it reached the carboxyl terminus). The program then produces a figure, known as a hydropathy plot, which describes the average hydropathy values throughout the entire polypeptide sequence. An example of a hydropathy plot is shown here. [Insert Text Art 22.4] A. How many transmembrane segments are likely in this polypeptide? B. Draw the structure of this polypeptide if it were embedded in the plasma membrane. Assume that the amino terminus is found in the cytoplasm of the cell. Answer: A. This sequence has two regions that are about 20 amino acids long and very hydrophobic. Therefore, it is probable that this polypeptide has two transmembrane segments. B. E16. Explain how a computer program can predict RNA secondary structure. What is the underlying genetic concept that is used by the program to predict secondary structure? Answer: RNA secondary structure is based on the ability of complementary sequences (i.e., sequences that obey the AU/GC rule) to form a double helix. The program employs a pattern recognition approach. It looks for complementary sequences based on the AU/GC rule. E17. Are the following statements about protein structure prediction true or false? A. The prediction of secondary structure relies on information regarding the known occurrence of amino acid residues in α helices or β sheets from X-ray crystallographic data. B. The prediction of secondary structure is highly accurate, nearly 100% correct. C. To predict the tertiary structure of a protein based on its amino acid sequence, it is necessary that the protein of interest is homologous to another protein whose tertiary structure is already known. Answer: A. B. C. True False. The programs are only about 60 to 70% accurate. True.