This article reprinted from: Robinson, D. L., J. M. Lau, S. Porter, B. S. Wiseman, and M. Woodrow. 2009. Modular cloning and sequencing: A six-week project in molecular biology and bioinformatics. Pages 111-183, in Tested Studies for Laboratory Teaching, Volume 30 (K.L. Clase, Editor). Proceedings of the 30th Workshop/Conference of the Association for Biology Laboratory Education (ABLE), 403 pages. Compilation copyright © 2009 by the Association for Biology Laboratory Education (ABLE) ISBN 1-890444-12-X All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the prior written permission of the copyright owner. Use solely at one’s own institution with no intent for profit is excluded from the preceding copyright restriction, unless otherwise noted on the copyright notice of the individual chapter in this volume. Proper credit to this publication must be included in your laboratory outline for each use; a sample citation is given above. Upon obtaining permission or with the “sole use at one’s own institution” exclusion, ABLE strongly encourages individuals to use the exercises in this proceedings volume in their teaching program. Although the laboratory exercises in this proceedings volume have been tested and due consideration has been given to safety, individuals performing these exercises must assume all responsibilities for risk. The Association for Biology Laboratory Education (ABLE) disclaims any liability with regards to safety in connection with the use of the exercises in this volume. The focus of ABLE is to improve the undergraduate biology laboratory experience by promoting the development and dissemination of interesting, innovative, and reliable laboratory exercises. Visit ABLE on the Web at: http://www.ableweb.org Modular Cloning and Sequencing: A Six-Week Project in Molecular Biology and Bioinformatics Dave L. Robinson1, Joann M. Lau1, Sandra Porter2, Bryony S. Wiseman3, and Melissa Woodrow3 1 Department of Biology, Bellarmine University, 2001 Newburg Road, Louisville, KY 40205 drobinson@bellarmine.edu jlau@bellarmine.edu 2 Geospiza Inc., 100 West Harrison, North Tower, Suite #330, Seattle, WA 98119 sandy@geospiza.com 3 Biotechnology Explorer Program, Bio-Rad Laboratories, 2000 Alfred Nobel Drive, Hercules, CA 94547 Bryony_Ruegg@bio-rad.com Biography Dave L. Robinson received his B.S. and M.S. in plant science from the University of Arizona, and his Ph.D. in plant physiology from the University of Minnesota. Now an Associate Professor of Biology at Bellarmine University in Louisville, KY he has taught Principles of Biology, Plant Diversity, Molecular Biology, Environmental Science, and Genetics, as well as seminar courses in ethnobotany. He has served as Biology Department Chair as well as principal investigator on a 3year grant from NIH-NCRR. His research interests are in the physiology and genetics of weedy plants like ragweed, dandelion, and white snakeroot. Joann M. Lau received a Ph.D. from the University of Illinois Urbana-Champaign and a B.A. from Bellarmine University. While at UIUC, she was a USDA Agriculture Genome Sciences and Public Policy Fellow, she had also received the Colgate-Palmolive Graduate Fellowship, and the Eugene S. Boerner Graduate Fellowship. She has taught Molecular Biology, Drugs and the Human Body, Association for Biology Laboratory Education (ABLE) 2008 Proceedings, Vol. 30:111-183 112 ABLE 2008 Proceedings Vol. 30 Robinson, Lau, Porter, Wiseman & Woodrow Introduction to Forensics Science, Modern Genetics, Introduction of Life Sciences, Principles of Biology labs, Cell Biology labs, and Molecular Biology labs at Bellarmine University. Her research currently involves developing new molecular biology exercises for the classroom, studying the evolution of triple repeat diseases in non-human primates, the effects of Reishi mushroom on lung cancer cell proliferation, and the expression of allergenic-related genes in ragweed. Dr. Sandra Porter, the education director at Geospiza, Inc., has been a long time participant in biotechnology and bioinformatics education. For several years, Dr. Porter ran the biotechnology education program at Seattle Central Community College, and acted as the regional director for BioLink, an Advanced Technology Education center funded by the National Science Foundation. In 2001, Dr. Porter joined Geospiza, Inc., to work on an NSF-funded project to develop educational materials that use bioinformatics resources in the same way that they are used by biologists, as tools for understanding biology and doing experiments. During her years at Geospiza, Dr. Porter has written and published two laboratory manuals, one CD, and a peer-reviewed study examining the efficacy of using molecular viewing tools for teaching DNA structure, in addition to research papers on genetic variation in the genes for clotting factors. She has recently completed an NIH-funded project on applying the use of molecular viewing programs in student activities that explore alcohol metabolism and human polymorphisms. Currently, she is writing a textbook on bioinformatics for biology students. She also writes a blog called “Discovering Biology in a Digital World” (www.scienceblogs.com/digitalbio). Bryony Wiseman graduated in Molecular Biology from Glasgow University, Scotland. She won a four year Imperial Cancer Research Fund studentship and received a Ph.D. in Biochemistry in 1999 from University College, London for her work on Ras and Raf signaling. She was awarded a U.S. Dept. of Defense Breast Cancer Research Program post-doctoral fellowship to work on the extracellular environment and mammary gland development at UCSF, San Francisco. She changed focus in 2002 when she joined Bio-Rad Laboratories’ Biotechnology Explorer Program. She is currently a Staff Scientist at Bio-Rad and develops tools and kits to help teachers teach biotechnology to high school and undergraduate students. Modular Cloning and Sequencing: A Six-Week Project in Molecular Biology and Bioinformatics 113 Contents Introduction to cloning GAPDH gene (for instructor): Bioinformatic Analysis of GAPDH sequences (for student): Background Reading Bioinformatic Exercises Notes for Instructor: Materials Acknowledgement Literature Cited Appendix Introduction This project involves isolating (cloning) and analyzing a major portion of the gene for the enzyme Glyceraldehyde 3-phosphate dehydrogenase (GAPDH) from an uncharacterized plant species. This gene is considered a housekeeping gene (a continually-transcribed gene whose product is involved in basic cell function). It codes for an enzyme that catalyzes an important step of glycolysis, a stage of respiration that occurs in all living cells. Because of the central importance of GAPDH, this gene occurs in all plants, as well as in all other organisms. GAPDH is a crucial enzyme for all animals, protists, fungi, bacteria and plants. Therefore, there are lots of opportunities to draw connections between the molecular aspects of GAPDH and its biomedical, and evolutionary significance (Figge et al., 1999). For instance, the human GAPDH gene has been found to be highly expressed in 21 different types of cancer and may play an important role in future cancer therapies (Altenberg and Greulich, 2005). Others have reported that GAPDH is a multifaceted protein that may be involved in regulating transcription and programmed cell death, and may have a role in age-related diseases like Alzheimer’s and Huntington’s Disease (Kim and Dang, 2005; Sirover, 1999). GAPDH is also thought to have roles in DNA replication and repair, cytoskeletal organization, and phosphotransferase activity (Tatton et al., 2000). Aerobic respiration is composed of four basic stages: glycolysis, formation of acetyl coenzyme A, the citric acid cycle, and the combined processes of electron transport and chemiosmosis. It is this first stage, glycolysis that is the most uniform and unwavering in its occurrence in the biological world. Cells undergoing anaerobic respiration due to lack of oxygen, like fermenting yeast or vigorously exercised muscles, for instance, only carry out glycolysis. Prokaryotic organisms (bacteria) that do not contain mitochondria (where the latter three stages occur in eukaryotes) still carry out glycolysis. Glyceraldehyde 3-Phosphate (GAP) is invaluable to the second half of glycolysis (Dennis and Blakely, 2000). One of the unique features of GAP is its ability to have a phosphate added to it without having to consume an ATP. The product of the GAPDH reaction, 1,3-Bisphosphoglycerate 114 ABLE 2008 Proceedings Vol. 30 Robinson, Lau, Porter, Wiseman & Woodrow (BPG) is such a high-energy molecule that it is eventually acted upon by the other enzymes of glycolysis to yield two ATP molecules. After the loss of its two phosphates (for making ATP from ADP), and other structural alterations, that three-carbon molecule becomes pyruvate which can be actively transported into the mitochondria for the other stages of aerobic respiration. GAPDH Reaction: Glyceraldehyde 3-Phosphate + NAD+ + Pi → 1,3 Bisphosphoglycerate + NADH + H+ Another important feature of the GAPDH reaction is the generation of reducing power in the form of NADH. This coenzyme provides reducing power to hundreds of different enzymes involved in catalyzing oxidation-reduction reactions in the cell. The DNA sequence for the GAPDH gene that results from this exercise is important for several reasons. The basic structure of the GAPDH gene can be examined by looking at the DNA sequence. For instance, the location of specific introns and exons can be predicted using readily-available software. The amino acid sequence of the GAPDH gene product can also be predicted. Evolutionary differences between organisms can be studied by comparing plant GAPDH genes isolated by researchers working with other species (Olsen et al., 1975). The biochemical characteristics (e.g. the active site) can also be deduced by aligning the sequences from numerous species and looking for areas showing high levels of consensus. When the DNA sequence for a housekeeping gene, like GAPDH, is elucidated in relatively unstudied species it provides novel information about that plant that can be useful to other biologists. Whereas past researchers have focused more on studying the genomes of model species (like fruitflies, yeast, mice, Arabidopsis) information about lesser known species can be quite valuable in filling in the evolutionary gaps. Our goal is to see these unique sequences published in the GenBank (Benson et al., 2007) with teachers and students as a co-author (see Accession number DQ075672). The steps* to be taken in this project are as follows: Week 1: 1. Identify the plant species to be studied 2. Extract genomic DNA 3. Initial amplification the GAPDH gene using the polymerase chain reaction (PCR) Week 2: 4. Exonuclease treatment 5. Nested PCR Week 3: 6. Clean PCR product 7. Blunt PCR product 8. Ligate (insert) the GAPDH gene fragment into a plasmid vector 9. Prepare competent cells Modular Cloning and Sequencing: A Six-Week Project in Molecular Biology and Bioinformatics 10. Transform the plasmid into bacteria via heat-shock transformation Week 4: 11. 12. 13. digest 14. gene Select and multiply the bacteria containing the recombinant plasmid** Isolate the plasmid from the bacteria Confirm the presence of the plant gene in the plasmid by restriction enzyme Prepare plasmid for DNA sequencing and obtain the sequence of the GAPDH Weeks 5 and 15. analysis gene * Assumes 3-hour ** Students select day before lab Workflow of complete activity: 6: Perform bioinformatic of the plant GAPDH lab periods colonies at least one 115 116 ABLE 2008 Proceedings Vol. 30 Robinson, Lau, Porter, Wiseman & Woodrow Stage 1: DNA Extraction This project is an opportunity to perform novel research - to clone and sequence a gene that has not yet been categorized - and to add that information to the body of scientific knowledge about GAPDH. The first step in this exercise is to choose an interesting plant species to work with. Some model species (for example Arabidopsis thaliana or Chlamydomonas) or crop plants (like rice and wheat) have already had their genomes sequenced but you may choose to reproduce and confirm this data. Alternatively, you might choose to select a species that is less studied. There are over 250,000 plant species known to exist on the planet providing plenty of options to work with. Also, you could choose a variety or cultivar (within a species) that no one has examined yet. Background In order to clone a gene from an organism, DNA must first be isolated from that organism. This genomic DNA is isolated from one or two plants using column chromatography. Plant material is weighed, and the material is ground in lysis buffer with high salt and protein inhibitors using a micropestle. The solid plant material is removed by centrifugation, ethanol is added to the lysate and the lysate is applied to the column. The ethanol and salt encourages DNA to bind to the silica in the chromatography column. The column is then washed three times and the DNA is eluted using sterile water at 70oC. For PCR to be successful the DNA extracted needs to be relatively intact. The best sources for DNA extraction are green leaves, but fruit, roots, or germinating seeds should also suffice. It is better to use tissue that is relatively young and still growing, as the ratio of nucleus: cytoplasm will be more favorable, the cells walls will be thinner, and the amount of potentially harmful secondary products will be less. There are two features of plants that make DNA extraction different from animals. First, plants have a tough cell wall made of cellulose that has to be penetrated. This is easy to do with vigorous grinding using a mortar and pestle. Secondly, a major part of every plant cell is a vacuole that contains acids, destructive enzymes (including nucleases) and unique secondary compounds (chemicals produced from pathways that are not part of primary metabolism) that might potentially damage DNA. Since plants are immobile and cannot easily escape from herbivores and pathogens they produce a myriad of different secondary compounds to defend themselves. Animals, and bacteria, don’t Modular Cloning and Sequencing: A Six-Week Project in Molecular Biology and Bioinformatics 117 typically produce as many interfering secondary compounds as plants do. Although some of these phytochemicals can be toxic others have proven very useful. We wouldn’t have the herbs, spices, natural flavorings and numerous medicines we have if it weren’t for these secondary metabolites. Vacuoles are the ‘garbage-dump’ of the cell as plants cannot excrete wastes the way animals do. Since 90% of the cell volume consists of vacuole - that is a lot of waste. With the pH in the vacuole being 5.0-5.5, and there being lots of harmful chemicals (as well as nucleases) in it, this organelle can be problematic when doing DNA extractions because it is impossible to break open plant cells and nuclei without also breaking the vacuole. To minimize contaminants from the vacuolar contents, salts and other inhibitors need to be added to the lysis buffer. Step 1- DNA Extraction Protocol Procedure can be obtained at explorer.bio-rad.com Stage 2: GAPDH PCR The overall purpose of this experiment is to clone a portion of the Glyceraldehyde 3-phosphate dehydrogenase (GAPDH) gene. Because it is a vital metabolic enzyme involved in one of the most important of biological processes (glycolysis) the GAPDH protein is highly conserved between organisms, especially in vital domains of the enzyme such as the active site. However, this does not mean that the gene sequences are identical in different organisms. Much of a gene’s DNA sequence does not code for protein. This intronic DNA is not subject to the same selective pressures as DNA that codes for protein. In addition, within exons (gene sections that encode for proteins) there is degeneracy of the genetic code such that different DNA triplet codons encode the same amino acid. Also, regions of the enzyme that are less vital to function (non-active sites, for instance) might not experience the same degree of selective pressure as more important regions. Although there is still conservation of protein sequence in these regions, it may not be as stringent as in others. Background To clone a known gene from an uncharacterized organism, PCR primers (synthetic singlestranded oligonucleotides 17-25 bases long) must be designed that are complimentary to conserved regions of the GAPDH gene. Even conserved regions are not identical between organisms, however. A best guess of the gene 118 ABLE 2008 Proceedings Vol. 30 Robinson, Lau, Porter, Wiseman & Woodrow sequence is made using a comparison alignment from the sequences of the GAPDH gene from different-related organisms, with the understanding that the primers will not be an exact match to the sequence and may amplify non-specific sections of DNA in addition to the target sequence. A second set of primers has been designed (interior to the first set of primers) and used to amplify the PCR products from the first round of PCR. This is called nested PCR and is based on the extremely slim chance of non-specific amplified DNA also encoding these interior sequences. In other words, if the wrong fragment is amplified with the first primers, the probability is quite low that the wrong fragment will be amplified during the second round of PCR. As a result, the PCR products generated from nested PCR are very specific. Since the nested PCR primers are interior within the first fragment, the PCR products generated during the second round of PCR are shorter than first one. See diagram below for an illustration of nested PCR. The strategy for this experiment uses nested PCR to amplify portions of the GAPDH gene we want to study. In the initial PCR reaction, a set of degenerate primers is used. These are primers that contain more than one DNA nucleotide base at specific positions increasing the likelihood that the primer will bind. Then in the nested PCR reaction a more specific set of primers will amplify GAPDH. The initial PCR reaction uses Primer Set 1, while the second round of PCR, the nested PCR reaction, uses Primer Set 2. It is very important not to reverse the order of the primers or to mix the two primer sets together. Arabidopsis genomic DNA has been included as a control for these PCR reactions. In addition, a plasmid encoding the initial PCR product amplified from Arabidopsis genomic DNA acts as a second control. As each PCR reaction takes approximately 3-4 hours to run, it is more practical to run the PCR reactions on separate days. Since the reagents used in these experiments function most optimally when prepared fresh, it is highly recommended that the reagents be prepared just prior to setting up the PCR reactions. Step 2A- Initial PCR Reaction Protocol Procedure can be obtained at explorer.bio-rad.com Modular Cloning and Sequencing: A Six-Week Project in Molecular Biology and Bioinformatics 119 Stage 2B: Nested GAPDH PCR Background In this next experiment, PCR products generated in the previous stage will be further amplified (i.e. serves as the template) in a second round of PCR. However, before performing the nested PCR, the primers that were not incorporated into the PCR product must be removed so that they do not amplify target DNA in the second round of PCR. To do this an enzyme that specifically digests single stranded DNA, Exonuclease I, is added to the PCR reactions. In nature, this enzyme is involved with proofreading and editing newly-synthesized DNA. Exonuclease I needs to be inactivated before it is introduced into fresh PCR reactions containing Primer Set 2 to prevent it digesting those new primers. Following Exonuclease I treatment, PCR products generated in the first round of PCR is amplified using the nested primers. Plasmid DNA will also be amplified in this step to serve as a positive control for PCR. A no-template negative control is also run. Step 2B- Nested PCR (Second Round) Protocol Procedure can be obtained at explorer.bio-rad.com Stage 3: Analysis of PCR Products by Agarose Gel Electrophoresis Background To assess PCR success the products of both the initial and nested PCR reactions are analyzed by agarose gel electrophoresis. The procedure for this can be found on-line at www.bio-rad.com Analyzing results from the PCR Students examine and interpret their gels. An example is given below. A+ = Arabidopsis genomic, positive control; P+ = plasmid, positive control; ─ = water, negative control; Plants 1-6. I = initial PCR; N = nested PCR. 120 ABLE 2008 Proceedings Vol. 30 Robinson, Lau, Porter, Wiseman & Woodrow The portion of the GAPDH gene that has been targeted for PCR varies in size between plant species. The expected size of the fragment from the first round of initial PCR should be between 0.8 and 2.5 kb. The size of the fragment from the second round of nested PCR is expected to be slightly smaller than the PCR product from the initial round of PCR. It is possible that some plants may amplify doublets - two DNA fragments of similar sizes. This is probably due to amplification of two GAPDH genes that are very homologous (genes that share similar structures and functions that were separated by an ancient duplication event). Students construct a table of their results, addressing issues like number and relative intensity of the bands. We recommend that a laboratory class of 16 students attempt to work with only one or two plant species. Cloning the same gene multiple times will provide significant depth of coverage. This will help resolve any ambiguities when the gene sequence is prepared for eventual publication in the GenBank. Remember, the ultimate goal of this laboratory is to provide new data for the scientific community at large, thus it is vital that the data be as correct as possible. It is recommended that the plant species chosen for this project be the one that generates the cleanest PCR product (fewest background bands) with strong band intensity of an appropriate size. It is acceptable to clone doublets since each plasmid is expected to ligate a single DNA fragment. Be aware that this may mean that two different gene sequences are obtained, however. Stage 4: Purification of PCR Products The next step, after generating DNA fragments, is to find a way to maintain and sequence these products. This is done by ligating (inserting) the fragments into a plasmid vector (small circular pieces of double-stranded DNA found naturally in bacteria) that can be propagated in bacteria. To increase the success of ligation, it is necessary to remove unincorporated primers, nucleotides and enzymes from the PCR reaction. This is done by using size-exclusion column chromatography. In this method, small molecules (like proteins, primers and nucleotides) get trapped inside the chromatography beads, while large molecules (like DNA fragments) are too large to enter the beads and thus pass through the column into the micro-centrifuge tube. Modular Cloning and Sequencing: A Six-Week Project in Molecular Biology and Bioinformatics 121 Background Following PCR, the amplified product needs to be purified using spin columns supplied in this kit. The purpose of the purification step is to remove unincorporated dNTPs, Taq Polymerase, primers and small primer-dimers so that the DNA can be successfully digested with a restriction enzyme and ligated to a vector. This procedure uses small spin-columns that fit into microcentrifuge tubes. These columns contain a matrix composed of miniscule beads having numerous microscopic pores. These porous beads significantly increase the surface area of the matrix. Any solutes that are added to the matrix will diffuse in and out of the pores very readily, but only if the solutes are small enough to enter the spaces. The porous spaces occurring in this matrix are designed to be large enough to hold onto smaller molecules (like dNTPs and primer-dimers), but too small to hold onto larger molecules (like PCR product). In this scenario, the PCR product that we want to clean will not be detained by the porous matrix because it is hundreds of DNA bases in length. When the PCR product is applied to the top of the columns the molecules that we want to exclude (like dNTPs, Taq Polymerase, primers) will tend to be captured by the microscopic pores and the larger molecules (our DNA product) will be pushed through the matrix much more readily. DNA strands between the sizes of 32 bp and 200 bp should be eliminated, thus yielding a very pure GAPDH PCR product. Without this cleaning step we could be less successful in the next steps of the cloning process. The opportunity also exists here to run gel electrophoresis of samples of PCR product before and after cleaning to demonstrate the efficacy of the spin-column cleaning. At this step it is helpful to know the DNA concentrations of both the clean, digested PCR product and the vector. This can be determined using a fluorometer, spectrophotometer or by using a DNAdye assay. These concentration values are used in determining the amounts of DNA used in the subsequent ligation reaction. Step 4- Purification of PCR Product Protocol Procedure can be obtained at explorer.bio-rad.com Stage 5: Cloning - Ligation and Transformation Background - Ligation In this stage the PCR product will be inserted (ligated) into a plasmid vector. The plasmid is supplied and has already been opened-up to receive the fragment. However, prior to ligating the fragment into the plasmid the PCR fragment must first be treated to remove the single adenosine nucleotide that is left on the 3´ ends of the PCR fragment by Taq DNA polymerase. This is performed by a proofreading DNA polymerase (enzymes with a 3´ proofreading exonuclease domain that allows the polymerase to remove mistakes in the DNA strands). This polymerase functions at 70oC, but not at lower temperatures, so it is not necessary to inactivate this enzyme after use. 122 ABLE 2008 Proceedings Vol. 30 Robinson, Lau, Porter, Wiseman & Woodrow Once blunted, the PCR fragment is combined with the plasmid. A T4 DNA ligase (an enzyme that catalyzes the formation of phosphodiester bonds between the 5´-phosphorylated PCR fragment and the 3´-hydroxylated blunt plasmid) is added and the ligation reaction is completed in 5-10 minutes. Background - Transformation During ligation many different products are produced. In addition to the desired ligation product where the PCR fragment is inserted into the plasmid vector, the vector may re-ligate, the PCR product may ligate with itself, etc. Relatively few of the DNA molecules formed during ligation are the desired combination of insert + plasmid vector. To separate the desired plasmid from other ligation products, and also to have a way to propagate the plasmid, the ligation reaction is transformed into bacteria. Plasmid vectors are natural bacterial plasmids that have been genetically modified to make them useful to molecular biology researchers. In order to get a plasmid into bacteria, the bacteria must be made competent. To make bacteria competent they must be actively growing, ice cold and suspended in transformation buffer that makes them porous and more likely to allow entry by plasmids. In this stage, bacteria are grown in culture media so that they are actively growing. Then they are pelleted, cooled and resuspended in transformation buffer two times to ensure they are competent. It is vital to keep bacteria on ice at all times. Bacteria are then mixed with the ligation reaction and plated on warm LB Ampicillin agar plates that will only permit bacteria expressing ampicillinresistance genes (encoded by the pJet1.2 plasmid) to grow. These plates also contain isopropyl β-D1-thiogalactopyranoside (IPTG) which induces expression of the ampicillin-resistant gene. Plates are then incubated at 37oC overnight. To confirm that cells were made competent by this procedure, a control plasmid is also be transformed. Step 5- Preparation of Competent Cells, Ligation and Transformation of PCR Product into Plasmid Protocol Procedure can be obtained at explorer.bio-rad.com Analysis of results of ligation and transformation Students count the number of bacterial colonies growing on both their control and transformation plates. Stage 6: Isolation of Plasmid DNA Background It is necessary to analyze the plasmids that have been successfully transformed to verify that they have the PCR fragment inserted. To do this, a sufficient amount of plasmid DNA is obtained by growing a small culture of bacteria, purifying the plasmid from the bacteria, and performing restriction enzyme digests. Restriction enzymes cut double-stranded DNA at specific recognition sequences on the plasmids. The plasmid used to ligate the PCR products is pJet1.2 which Modular Cloning and Sequencing: A Six-Week Project in Molecular Biology and Bioinformatics 123 contains Bgl II restriction enzyme recognition sites on both sides of the insertion region (figure below). Thus, once the plasmid DNA has been isolated, a restriction digestion reaction is performed to determine the size of the insert. Step 6- Purify and Analyze Plasmid Mini-preps Protocol Procedure can be obtained at explorer.bio-rad.com Analyzing results from the restriction digest Students are expected to interpret their gels and prepare samples for sequencing. An example of a gel is given below. Each lane represents a different clone that was restriction digested. Stage 7: Set up Sequencing Reactions To study this newly-cloned GAPDH gene the recombinant plasmids need to be sequenced. Like PCR, sequencing reactions rely on the basic principles of DNA replication and as such require primers to initiate the replication. However, sequencing is performed in just one direction and so instead of a primer pair, sequencing makes use of single oligonucleotides. Each sequencing reaction 124 ABLE 2008 Proceedings Vol. 30 Robinson, Lau, Porter, Wiseman & Woodrow proceeds in a single direction, so two sequencing reactions are set up for each plasmid - one forward and one in reverse. In this lab, plasmid DNA will be combined with sequencing primers (primers that anneal to the target DNA at a known location on a specific vector DNA strand) and then mailed to a sequencing facility which will perform the sequencing reactions and then send the DNA sequences back for analysis. Background The technique for determining the exact order of As, Ts, Cs and Gs in cloned DNA is the Dideoxy Method. This method, called the Sanger Method, is named for Dr. Fred Sanger of Cambridge, England who invented it in the mid-1970’s (Sanger and Coulson, 1975). In this approach, the plasmid clones are heat denatured (to separate the complementary DNA strands with high temperature), and used as template to synthesize new strands of DNA. To achieve this, the template is incubated with commercially available DNA Polymerase, sequencing primers, deoxynucleotide triphosphates (dATP, dTTP, dCTP, dGTP), and relatively small amounts of dideoxynucleotide triphosphates (ddATP, ddTTP, ddCTP, ddGTP). The difference between the deoxy- form and the dideoxy- form of nucleotide triphosphate is a missing -OH group on the 3´ carbon of the deoxyribose. This missing hydroxyl group is necessary for normal DNA synthesis, so if a growing chain of DNA happens to utilize a ddNTP, instead of a dNTP, the DNA synthesis reaction is stopped. This termination event occurs rarely enough that all possible lengths of DNA get synthesized during the process. For instance, a sequencing reaction that provides the DNA sequence for 800 bases would essentially involve synthesizing 800 different strands of DNA, covering the entire range of possible lengths. These different lengths of DNA are resolved by electrophoresis (frequently capillary gel electrophoresis) and visualized. A common visualization method is to ‘end label’ each of the four types of dideoxynucleotide triphosphate with a different fluorescent dye that can be distinguished with a digital camera during electrophoresis. DNA sequence output is based on the fact that longer lengths of DNA move more slowly during electrophoresis than shorter lengths, and that the digital camera can detect the color of the fluorescent dye that labels each of the bands. Since the specific color of the dye attached to each of the different ddNTPs is known, and since that specific ddNTP will end-label the growing DNA strand only where its complement occurs on the plasmid template, the task of correlating the order of colors with a specific sequence of DNA is relatively straightforward. This method detects and records the dye fluorescence and shows the output as fluorescent peaks on a chromatogram. The primers used for DNA sequencing are different from the primers used to amplify the GAPDH gene via PCR. Sequencing primers are designed to complement the DNA sequence of the cloning vector, rather than the insert DNA. If the primers used for PCR were also used for sequencing then part of the clone’s sequence would be missing because sequencing starts about 2050 bases away from the primer itself. This is a function of the size of DNA Polymerase. Most commercially-available cloning vectors are designed to have sites that are relatively far from the multi-cloning region and that will bind to widely-available sequencing primers. These universal sequencing primers allow researchers to work with different cloning vectors at the same time. Modular Cloning and Sequencing: A Six-Week Project in Molecular Biology and Bioinformatics 125 However, the current method used in DNA sequencing only generates ~500 bp of usable DNA bases whereas the GAPDH gene in some species is much larger. Therefore, internal sequencing primers are used to primer walk (primers designed to obtain a contiguous sequence from an internal region of the gene of interest). In our case, double-stranded primer walking (providing contiguous sequence information from both DNA strands) will be performed. The U.S. Department of Energy’s Joint Genome Institute in Walnut Creek, CA has a Sequencing Training Program that offers free DNA sequencing for classroom projects such as this. More information is available at http://www.jgi.doe.gov/education/stp.html Step 7- Set up Sequencing Reactions Protocol Procedure can be obtained at explorer.bio-rad.com Sequencing Reaction Plan Plate Barcode Identifier____________________ Potential file names after conversion from 96 well plate to 364 well plate A (A,B ) B (C,D C) (E,F) D (G,H E) (I,J) F (K,L) G (M,N H) (O,P ) 1 2 3 4 5 6 7 8 9 10 11 12 (1,2) (3,4) (5,6) (7,8) (9,10) (11,12) (13,14) (15,16) (17,18) (19,20) (21,22) (23,24) 126 ABLE 2008 Proceedings Vol. 30 Robinson, Lau, Porter, Wiseman & Woodrow Bioinformatic Analysis of GAPDH Sequences In this section, you will employ some of the features of iFinch to review your experimental results and obtain a preliminary identification for your gene. iFinch contains both software and a relational database. Both of these features are involved in data management and analysis. People work with iFinch through the web, the web pages collect information, the information is analyzed by software in iFinch, and it’s stored in a relational database. Other software programs pull information out of that database, analyze it, and present it in tables, and reports. Cloning the GAPDH gene • Extract genomic DNA from plants • Amplify region of GAPDH gene using PCR • Assess the results of PCR • Purify the PCR product • Ligate PCR product into a plasmid vector • Transform bacteria with the plasmid • Select and grow bacteria containing the plasmid The following types of analyses will be performed: • Isolate plasmid from the bacteria 1. Assess the quality of the data and view sequence traces (Geospiza’s FinchTV™) • Sequence • Perform bioinformatics analyses with the sequence data 2. Identify the cloned GAPDH sequences 3. Assemble sequences into a contig (CAP3 program) either through the Joint Genome Institute (JGI) or from a local DNA sequencing service DNA 4. Use your sequences to BLAST a nucleotide database (NCBI GenBank) 5. Identify introns and exons (adding annotations) 6. Translate the predicted mRNA sequence (in six reading frames) and check with BLASTx 7. Translate the mRNA sequence to predict the sequence for your protein 8. Phylogenetic analysis (NCBI) Section 1. Review the Quality of the Data and View Sequence Traces Extracting Sequences and Quality Assessment One of the first iFinch programs extracts the quality values and base sequence from the chromatogram file. In DNA sequencing, fluorescently labeled molecules of DNA are separated by size through capillary electrophoresis. As the DNA moves through the capillary, it passes in front of a detector that measures the intensity of fluorescence. Software in the sequencing instrument processes that signal and identifies each base. The chromatogram file that is produced by the sequencing instrument also includes information about the presence of other signals, their relative Modular Cloning and Sequencing: A Six-Week Project in Molecular Biology and Bioinformatics 127 positions, and their intensity. When this information is presented in a graph, it’s called a trace. At the top of the trace, are the base calls. Each letter represents a base, with unidentified bases shown as N’s. For each base, the height and shape of the peak corresponds to the signal intensity, and the spacing shows the relative times when the signals were measured. Many DNA sequencing instruments contain additional software that can evaluate the information from the signal intensity, timing, and whether or not peaks overlap. This software provides additional information about the quality of each base. A base is considered high quality when the identity of the base is unambiguous. Other features used in quality measurement are the evenness of the spacing between bases and the consistency of the signal strength. A high-quality region of sequence has evenly spaced peaks that do not overlap and a signal intensity in the proper range for the detection software. In general, most base-calling programs define quality scores in the same way. The quality value is inversely proportional to the probability that a base has been misidentified, thus a higher “quality value” means that we can be more confident that the base is correct. If the “quality value” is low, we are less likely to accept the base call. The equation for quality values is: Q = -10 log10(Perror), where P = the probability of an error. If the chance of a mistake was 1 in 100, for example, P would be 0.01, and the “quality value” would be 20. The image below, from Geospiza’s FinchTV program uses a blue line to mark the location of quality values equal to 20. 128 ABLE 2008 Proceedings Vol. 30 Robinson, Lau, Porter, Wiseman & Woodrow Quality Trimming The sequence of bases from each chromatogram is called a read. Once the read and quality information have been extracted from the chromatogram or determined by the base caller, other analytical programs can get to work. Since the data at the 5´ and 3´ ends of reads are often poor quality, a standard step in DNA sequencing is quality trimming. In this process, a program examines the quality of each base at the ends of each read. When a large enough fraction of the bases have quality scores above a certain threshold, usually 20, the trimming program marks that position at each end of the read and measures the length between trim points. Later, when we download the DNA sequence, we can elect to have those portions trimmed or hidden. In FinchTV, the trimmed regions are indicated by a grey shadow (shown below). In the iFinch chromatogram report (below), the “quality value” for each base is shown in a graph. Trimmed bases in the sequence have a strikethrough. Modular Cloning and Sequencing: A Six-Week Project in Molecular Biology and Bioinformatics 129 Vector Screening and Masking One other analytical program is important for our experiment. This program serves three functions: vector identification, vector masking, and sequence screening. The sequence identification steps involve comparing the read sequence to a database of DNA sequences, obtaining a score, and calculating the percent that match. In one step, the read is compared to a database of sequences from cloning vectors. In the other, the read is compared to a set of reference genomic sequences from the seven known Arabidopsis GAPDH genes. These analyses allow us to determine which parts of a read are similar to GAPDH and if any parts of the read came from the cloning vector. They also allow us to quickly evaluate our results and identify which genes were cloned and in what proportions. Vector masking is an optional step that can take place when data are downloaded from iFinch. If this option is selected, each base in the vector region is replaced by a special letter that will hide it from other DNA analysis programs. On the previous page, a program in iFinch colored the vector sequences blue and sequences that matched one of the GAPC genes red. When we view traces from iFinch in FinchTV, vector sequences are shaded pink. Relational databases and SQL Because data are stored in a relational database, we have the ability to ask novel questions about our data. When chromatogram files are loaded into iFinch, the data are stored in two ways. First, they are stored in their original form. If you download chromatograms from iFinch they will be identical in every way to the chromatogram files that were uploaded. Second, the information is extracted and stored in an organized fashion in tables inside the database. A relational database consists of several tables, based on a model of the data and the relationships between different types of data (a read for example, is a DNA sequence that’s linked to a chromatogram). Each table 130 ABLE 2008 Proceedings Vol. 30 Robinson, Lau, Porter, Wiseman & Woodrow represents a different data type. We might have a table with the kinds of data that are found in chromatograms and another table with the kinds of data that are found in reads. Section 2. Assembling the Sequence and Correcting Mistakes with FinchTV Assembly Programs and Why We Assemble DNA Sequences In nature, DNA molecules are found in variety of sizes, many of which are quite large. Even chromosome 21, the smallest human chromosome, is 47 million nucleotides in length. The smallest Arabidopsis chromosome is 18.5 million nucleotides long. DNA sequencing technology however, rarely produces sequences longer than 900 bases. Consequently, we can only find the sequence of a longer piece of DNA by reconstructing it from smaller pieces. This process is called assembly or sequence assembly. You will assemble the reads from your clones with a program called “CAP3.” Many assembly programs, including “CAP3”, work by comparing all the sequences to each other, calculating a score for each pair of sequences, and then merging the sequence pairs together, working from the highest scoring pair to the lowest scoring pair, until all possible pairs have been merged. The contiguous sequences that result from merging shorter sequences are called contigs. A diagram of a contig is shown below. Some assembly programs like “CAP3” and “Phrap” can also use quality information, when available, to help guide the assembly process. If there are positions where the sequences disagree, these programs use the higher quality base as the choice for the contig. In genome sequencing, the next step that occurs is called finishing. Finishing is a process where researchers examine the contigs to look for misassemblies or regions that require additional coverage. That information may be used to identify errors, edit sequences and reassemble, or to synthesize new primers and generate additional sequences to cover gaps and put contigs together. One of the greatest problems in sequence assembly results from repetitive sequences. Repetitive sequences, also known as repeats, occur at multiple positions in the genome and are closely related to one another. These sequences complicate the assembly process because the assembly programs find high scoring alignments between repetitive sequences and assemble them, even when they were obtained from different parts of the genome. Finding errors that occurred from assembling the wrong repeats together is still a challenge in genomic sequencing. Fortunately, Arabidopsis has few repetitive sequences. Describing RepeatMasker is beyond the scope of this manual, but if you are Modular Cloning and Sequencing: A Six-Week Project in Molecular Biology and Bioinformatics 131 interested in screening your clones for repetitive sequences, you may use the RepeatMasker server at the Institute for Systems Biology (www.repeatmasker.org). Interpreting the Assembly Results You will save two of the files from your assembly results, the contig sequence and the assembly details. The “assembly details” file shows the contigs that were put together during the assembly, lists the reads that were used in constructing each contig, and shows the positions of each read relative to the contig sequence. Your “assembly results” will most likely show only one contig since all of your clones should map to the same region of DNA. When we view our “assembly details” file, we see this same information in greater detail. A portion of an “assembly details” file is shown below with an alignment between three reads and the consensus sequence. The consensus sequence (at the bottom) is the same as the contig sequence. If quality values were supplied to the assembly program, the resulting sequence is a mosaic, put together from the highest quality bases in each of the three reads. (In our case, the “CAP3” assembly program that we’re using doesn’t have a place for entering quality values.) By convention, the consensus sequence is shown in a 5´ to 3´ orientation. Above the consensus sequence are the reads. If a read is in the same orientation as the consensus sequence, a + sign appears after the read name. If the sequence of a read came from the other strand, “CAP3” determines the complementary sequence and prints it in the reverse direction. We call this sequence the reverse complement. Presenting the reverse complement of a sequence is helpful because it makes it easier for us to spot positions where the reads disagree. These disagreements are called discrepancies. In the image above, there’s a substitution discrepancy where two reads identify a base as an “A” and the other read contains a “C.” Another kind of discrepancy is called an indel. The word “indel” refers to either an insertion or deletion. Indels are important because they can change the reading frame and make it harder for us to predict the right protein sequence. The example below contains an indel. The read from QCDP869377.b2_A01.ab contains an “A” and this base is missing (deleted) from QCDP869381.b2_I01.ab. 132 ABLE 2008 Proceedings Vol. 30 Robinson, Lau, Porter, Wiseman & Woodrow Which read is correct? We can answer this question by viewing the trace in FinchTV. In this part of the project, we will use FinchTV to review the assembly results and investigate discrepancies between reads. If we find a mistake in our contig sequence, we will edit our reads and reassemble. Section 3. BLAST Searches Introduction to BLASTn BLAST stands for Basic Local Alignment Search Tool. (Altschul et al., 1990). The BLAST family of programs is designed to find short (local) regions where pairs of sequences match. Members of the BLAST family work in similar ways but for now, we’ll limit our discussion to BLASTn. BLASTn is used to compare a nucleotide sequence to a database of nucleotide sequences. To do this comparison, BLASTn breaks the query sequence (the sequence you’re searching with) into ‘words’ of a defined length. Then BLASTn compares each word to a database of words, derived from a set of nucleotide sequences. If all the letters in the words match perfectly, BLASTn looks at each end of the word pair to see if the matching region might be extended and tries to make the longest matching region that it can. At the end of the search, BLASTn counts all the nucleotides in the matching region and awards two points for every pair of bases that match. If one sequence has an insertion, a deletion, or a gap (more than one base is missing) relative to the other, BLAST takes points away from the score. The net result is that a BLASTn score is approximately equal to two times the length of the matching region. The BLAST score is one of the primary statistics used to evaluate a match. The other main statistic is the E value. The “E value” represents the number of sequence matches that you would expect to find (where a different database sequence matches your query just as well as the subject sequence matched your query) if you were to search a database of random sequences. When E values are below 1, they are equivalent to the probability that two sequences will match to a certain extent. This would mean that if we have an “E value” of 0.01, then there is a 1% chance that we Modular Cloning and Sequencing: A Six-Week Project in Molecular Biology and Bioinformatics 133 would find an equally good match in a database of random sequences. Often E values are very low. In fact, if we have a perfect match, the “E value” might be given as zero. It should be noted that even though these values are shown as zero, they are never really zero. They’re just presented as zero because the number of characters in the exponent might be too large to fit nicely in a table. While low E values are good, high E values tell us that it might be possible to find an equally good match by random chance. Two additional factors have a strong influence on E values. These are the length of the sequence and the size of the database. This is because it’s easier to find a perfect match to a shorter sequence. It’s also easier to find a match in a larger database. Understanding BLASTn Results The results from a BLASTn search include many different kinds of information and statistics. These bits of information include the size of the database, the length of each query sequence, statistics that describe the number and percent of matching bases, a BLAST score, and the E value. The sequences in this example (shown below) come from a GAPDH cloning experiment with a plant from the genus Salvia (the common name is Sage). If you wish to use these reads and repeat this test, these reads are in the “salvia1” folder in the GAPDH_test_data project, and they should be available in your iFinch. The first piece of information we see from BLASTn is a graph that shows a map of the query sequence on top (in this case, a Salvia sequence), with colored boxes below to show where the database sequences match and how well they match. A key at top of the graph shows how the different colors correspond to the BLASTn scores. Below the key is a thick red bar depicting our query sequence. In this case, the query was 766 bases long. Below the query are different colored lines that represent matching sequences from the database. The top line has six colored blocks (of varying length and color) that are attached by thinner lines. Each block corresponds to a part of the database sequence that matched the query. It looks like the subject sequence has the best match to the query between nucleotides 160 and 440 (of the query). The next information from BLASTn is a table that summarizes the statistics. Each row contains a matching sequence with the best matching sequence at the top of the table. Notice, in this table, the 134 ABLE 2008 Proceedings Vol. 30 Robinson, Lau, Porter, Wiseman & Woodrow information in the Description column tells us that the matching sequences come from specific chr om oso me s, wit h the Accession number linked to the matching chromosome. We’ll find a more precise location for our subject sequences and which GAP genes are matched in just a moment. As we read across the table towards the right, the next column we come to is the Max score column. Each of the colored blocks in the BLAST alignment graph above has been assigned a score based on the goodness of the match. The “Max score” comes from the block of aligned sequence that had the highest score. If we remember that the BLASTn score is about twice the number of matching nucleotides, we can estimate that the maximum score for the top sequence either represents 113 matching bases or a longer region that contains gaps. The next value is the Total score. The “Total score” is obtained by adding the scores from each matching region. In our case, the “Total score” is not very helpful because it represents the total from all the matching regions on a single chromosome. Since both chromosomes 1 and 3 contain multiple copies of the GAPDH genes, this score isn’t informative. In the next column, the Query coverage corresponds to the fraction of the entire query sequence that’s matched by parts of the subject. Next, we have the E value. In the top row, the “E value” is 6 x 10-58 (the “e” in the table stands for “exponent”). This means that there is a 6 in 10-58 chance of finding this match in a database of random sequences. In other words, a match like this is not likely to occur by random chance; which is good. In other words, the lower the “E value” the better. Moving farther right, the Max ident column shows the block of sequence that has the highest percentage of matching bases. In this case, the “maximum identity” of any matching block is 100%; however, when we scroll down and examine the matching regions in more detail, we find that the only region where 100% of the bases match is only 18 nucleotides long. In this experiment, this statistic is not terribly useful. Modular Cloning and Sequencing: A Six-Week Project in Molecular Biology and Bioinformatics 135 The last column contains links to other databases that are identified in a key above the table. In the case of our results, the last column contains links to GEO. “GEO” is the Gene Expression Omnibus database. This database stores information from microarray experiments. To view our alignments, we can either scroll down the page, click the link in the “Max score” column, or click a subject sequence in the alignment graph. The sequence alignments are organized by subject sequence, with all the matches from one subject sequence grouped together. The sets of alignments are presented in the order of the maximum score, with the first set containing the longest and best alignment. This can be seen in the alignments below. At the top of this set, we see the name of the subject sequence. It appears that the best match to this query sequence is GAPC. We can confirm this by looking at the sequence location in the Arabidopsis genome. The numbers that are shown at the beginning and ending of each alignment correspond to specific nucleotide positions in the sequences. For the subject sequence, these values are locations in a specific chromosome. We can see from the map below that the sequence coordinates from the first alignment, 1,082,841 and 1,082,587 placing this alignment within the GAPC gene on chromosome 3. We can conclude that the identity of this query sequence is GAPC. 136 ABLE 2008 Proceedings Vol. 30 Robinson, Lau, Porter, Wiseman & Woodrow In other cases, the results can be more ambiguous. In the image below, we see BLASTn results from one of the other Salvia query sequences. The best matching regions of the subject sequences (GAPC and GAPC-2) match this query almost equally well. In this case, we have to decide which sequence is the best match by calculating the total score for all the alignments between the subject sequences (within a specific gene) and our query. To Analyze Your BLASTn Results I n this sec tio n, you wil l Modular Cloning and Sequencing: A Six-Week Project in Molecular Biology and Bioinformatics 137 examine the two genes that your query sequence match best and calculate a “Total score” for the match to each gene (if there’s a match at all). For each query sequence, examine the graph and the coordinates of the matching subject sequences (shown in the alignment) to determine which two genes match your query sequence the best. To increase the certainty of your identification, you will need to determine the “Total score” for the match between the gene and your query sequence. To make this easier, there are tables provided in Exercise 3.2 that you can use for analyzing your results. Complete the information in two tables (one for each of the two best matching genes) for each of your query sequences. The tables show the four GAPC genes (GAPC, GAPC-2, GAPCP-1, and GAPCP-2), their chromosomal locations, and the coordinates. For each query, use the two tables that correspond to the two best matches. Record the beginning and ending positions for each alignment, and the score for that alignment. When you are through entering the information for each region that aligns within that gene, calculate a total score for that gene by adding the scores for each region. Example alignment data from a BLAST search with a Salvia query sequence (QCDP869377) and an example of one completed table are shown below. Example alignments: For Query= QCDP869377.b2_A01.ab1 Chromat_id=2140 Length=766 Type=Folder Name=KSalvia1 Id=47 Length=766 Alignments to GAPC: 138 ABLE 2008 Proceedings Vol. 30 Robinson, Lau, Porter, Wiseman & Woodrow Alignments to GAPC-2 Modular Cloning and Sequencing: A Six-Week Project in Molecular Biology and Bioinformatics 139 140 ABLE 2008 Proceedings Vol. 30 Robinson, Lau, Porter, Wiseman & Woodrow Example tables: Query= QCDP869377.b2_A01.ab1 Gene chromosome Begin End Query begin Query end GAPC 3 1,082,841 1,082,587 183 437 226 1,082,344 1,082,244 666 766 87.8 1,083,030 1,082,920 6 116 75.2 Total score Score for query sequence match 389 Gene chromosome Begin End Query begin Query end GAPC-2 1 4,609,088 4,609,193 334 439 111 4,608,855 4,609,004 187 336 86 4,609,431 4,609,531 666 766 75.2 Total score Score for query sequence match 272.2 i. Use the total score that’s calculated from the table to identify the gene that your sequence matches best. Repeat this for each of your query sequences. ii. The result from the best match gives you the identity of your gene. Conclusion from this example: Our analysis shows that the identity of the sequence from chromatogram QCDP869377.b2_A01.ab1 is most likely to be GAPC. This identification is supported by the total BLASTn score of 389, which is higher than the next best match (GAPC-2), with a score of 272. Modular Cloning and Sequencing: A Six-Week Project in Molecular Biology and Bioinformatics 141 Section 4. Adding Annotations: Finding the Intron-Exon Boundaries, Putting the mRNA Together and Checking Prediction with BLASTx Annotating Most eukaryotic genes contain internal sequences that do not code for protein. During gene expression, RNA splicing removes these sequences, known as introns, and fuses the sequences that contain coding information (exons) together. This process is shown below. Our goal in this stage of the project is to construct a gene model that shows where the exons are likely to be located within our contig and to predict the likely amino acid sequence for the encoded protein. The process of identifying the coding sequences and adding information to our work is called annotating the sequence. The extra bits of information are called annotations. Predicting splice sites in genomic DNA is an active area of research. We can build gene models by aligning mRNA sequences to genomic DNA, but we do not understand enough about splicing signals yet to predict splice sites accurately by computation alone. Ideally, we would identify our splice sites by aligning our genomic contig sequence to a complete mRNA sequence from the same gene, from the same species of plant. Since we are looking at new genes, however, those sequences are probably not available. Luckily, we have an alternative. We can use the reference mRNA sequences from other plants. These are mRNA sequences that have been reviewed by NCBI staff and characterized. We will begin this stage by aligning our contig sequence to the reference mRNAs with BLASTn, then, we will use Microsoft Word® or another text editor to construct our gene model by putting the sequence of our predicted mRNA together. We will test the gene model with BLASTn and make corrections as needed. These steps will be repeated multiple times until we’re satisfied that the model is correct. Finally, we will test the predicted protein sequence to determine if the splice sites were identified correctly. 142 ABLE 2008 Proceedings Vol. 30 Robinson, Lau, Porter, Wiseman & Woodrow How to Interpret the Results and Predicting the Exon Positions The BLASTn results will show the name of your contig and it’s length in nucleotides towards the top of the page. Below, will be a graph showing where portions of the reference mRNA sequences align to your contig. The ends of each alignment correspond approximately to the ends of each exon. The graph shown here was obtained by using the “salvia1” contig after reviewing the assembly and editing the contig sequence to correct any errors. From the graph and the table, the second mRNA, from the Arabidopsis thaliana GAPC gene, has the longest and most extensive match. The first gene turns out to be an unidentified genomic sequence from rice. We can predict from this graph that the exons on our contig are approximately located between nucleotides 0 to 100, 225 to 400, 500 to 550, 700 to 800, and 1260 to 1300. Now we will look for more precise exon/intron boundaries and mark the positions in the contig. This will involve reformatting your BLAST results, preparing your contig sequence, and working with the sequence data. Section 5. Studying the Evolution of GAPDH Phylogenetics Phylogenetics is the study of the relationship between organisms. In the past, both anatomical as well as fossil records provided evidence on how closely or distantly related organisms are. However, with the advent of bioinformatics, molecular evidence (DNA and protein sequences) has been used to compare homologous sequences and to build phylogenies based on these sequence comparisons. The evolutionary relationship between taxonomic groups can be represented graphically using an evolutionary tree also called a phylogenetic tree. Similar to a family tree that illustrate descendants, family relationships, and successive generations; phylogenetic trees illustrate Modular Cloning and Sequencing: A Six-Week Project in Molecular Biology and Bioinformatics 143 the evolutionary relationships between taxonomic groups. This tree is based on the concept that there are common ancestors. Evolutionary time is represented by the branch length. Exercise 1.1 Examine Traces with FinchTV and Evaluating Data Quality. 1. Log into the practice iFinch account (http://www.geospiza.com/education/products/biorad.html) with the publically available user name and password. User name: “BR_guest” Password: “guest” Computer ID: ____________ 2. Find the folder with your class data by clicking the lined value or by clicking Folder from the Chromats menu. Your Folder Name ________________________________ 3. A page will appear with a list of folders. Click your folder label to see the data from your plant. 4. A page will appear that presents some of the data from the folder. Each row contains data from a single chromatogram. Each column shows a different type of data. 144 ABLE 2008 Proceedings Vol. 30 Robinson, Lau, Porter, Wiseman & Woodrow iFinch tables also include tools for working with the data. You may: • sort the data by clicking any of the column headings that appear in blue • filter or view data with certain characteristics by using the finder at the top of the table view the data as an Excel® spreadsheet by clicking the Excel® icon (view a trace in FinchTV by clicking the FinchTV icon) 5. Find the column labeled “Q20”. These data correspond to the number of bases in each read that have a “quality value” of 20 or greater. • 6. Click the column to sort the data so that the lower quality data appear (low numbers of Q20 bases) at the top of the table. 7. Click one of the chromatogram labels to view the Chromatogram Read report. Notice the graph of the quality values. The dotted line marks where the “quality value” is equal to 20. Do you see many quality values above 20? ______ How many quality values are above 20? ______ How many quality values are below 20? ______ 8. Open the first file by double-clicking on one of the FinchTV icons (located to the left of the chromatogram labels). What can you say about the quality of the bases at the beginning and end of the read? (You may wish to click the wrap button in FinchTV to scroll through the trace view more easily). Give the base numbers for the low-quality regions for all 4 files in your folder: 1) ________________________ 2) ________________________ Modular Cloning and Sequencing: A Six-Week Project in Molecular Biology and Bioinformatics 3) ________________________ 9. 4) ________________________ You can also see the quality values for different bases. Click a base that you think is low quality and look at the bottom left corner of the FinchTV window. You can see the quality value for any base and the position where it appears within the sequence. Repeat this action with a high-quality base. Did the quality values match your expectations? Explain your answer. Exercise 1.2 Examine Folder Reports and Statistics Now that you know what Q20 values represent and understand data quality for individual bases and reads, let’s look at your data. 1. Select the “Folders” link in the Chromats menu. 2. Click the label of your folder. 3. Click the “Folder report” link at the top of the table. 145 146 ABLE 2008 Proceedings Vol. 30 Robinson, Lau, Porter, Wiseman & Woodrow The Folder report presents a visual summary of the data in the folder. The Q > 20 histogram presents the number of reads that contain varying numbers of good quality bases. These types of graphs are helpful when you wish to compare sequencing results from different data sets. Exercise 2.1 Assemble Your Sequences In this portion of the project, you will assemble the FASTA files that were downloaded from iFinch and stored in your folder. To do the assembly, you will use the “CAP3” assembly server at the University of Lyon in France (http://pbil.univ-lyon1.fr/cap3.php). After the assembly, you will need to review the assembly details file and determine if there are any discrepancies between the reads. If you see discrepancies, you will need to review the trace files in FinchTV, edit the traces if necessary, and reassemble the edited reads. Modular Cloning and Sequencing: A Six-Week Project in Molecular Biology and Bioinformatics 1. Get your file with the FASTA-formatted read sequences from your iFinch folder. To do this, click the “Folder link” in the Chromats menu. 2. Locate and click the link to your iFinch folder. 147 148 ABLE 2008 Proceedings Vol. 30 Robinson, Lau, Porter, Wiseman & Woodrow 3. Notice the two tabs at the top of the table. Within each folder, files are organized by whether or not they’re chromatograms. If you click the Chromatogram tab, you will see a list of your chromatograms. If the Other tab is selected, this page shows a list of all the other files. Lets start assembling your contig: 4. Open up your folder. Click on “Download Folder Data”. On the Download page click on the “Export Sequences” button. If you have problems, Save it to the Desktop first and then Open that file using Word®. You might have to Browse through different programs to find Word®. 5. Open the file in a text-editing program (Windows). We are actually not going to save this! 6. We are going to copy-and-paste everything on this page, so highlight it and copy. 7. Go to the iFinch home page and click the “Sequence Assembly” link to access the University of Lyon CAP3 web service (http://pbil.univ-lyon1.fr/cap3.php). 8. Paste the copied text into the text box in the assembly web form. 9. Click the “Submit” button to start the assembly. 10. When the assembly is complete, a page will appear with links to your results. These links are: a. Contigs b. Single sequences c. Assembly details d. Your sequence file We’re going to look at each of these in turn. Use the back button on the browser to return to the assembly results page after viewing each page. 11. Click “Your Sequence File” first to make certain that you pasted the correct information into the form. 12. Next, click “Single Sequences.” Sequences appear here if they could not be used in the assembly. 13. Click “Contigs.” You will most likely see one, or possibly two, contig sequences. Save this page as a text file (.txt). The name of this file should include last-name, plant-name, and the phrase “contig-fasta.” Write that name here: _________________________________ 14. Ideally, all your read sequences should be assembled into a single contig. If you end up with multiple contigs, it’s possible that the sequences you assembled didn’t really belong together. Other explanations could be poor quality data or mistakes by the assemble program. If you do have multiple contigs, pick one for further analysis. Modular Cloning and Sequencing: A Six-Week Project in Molecular Biology and Bioinformatics 149 15. Click “Assembly Details.” Save this page as a .txt file, with your name, plant-name and the word “assembly.” If you saved this as a Word® file (.doc) be sure to change the left and right margins to 0.4 inches each. Write that name here: __________________________________ Exercise 2.2 Store Your Multi-Sequence FASTA Files in iFinch Before going forward, it will be helpful to store your downloaded contig FASTA file in iFinch. This way, it will be ready later for sequence assembly. 1. To upload your assembly results, return to iFinch and click the “Upload” link in the “Chromats” menu. 2. The Data “Upload” page will appear. 3. Choose “Generic” for the file type. 150 ABLE 2008 Proceedings Vol. 30 Robinson, Lau, Porter, Wiseman & Woodrow 4. Click the “Browser” button to find your file on your computer. 5. Click the spyglass to locate your folder. 6. Click the “Submit” button to upload your file. 7. Confirm that the uploading options you choose were correct. The message should say “Generic file”, containing your file name (your names, plant-name and the word “assembly”). * Note: If the wrong file type was chosen or the wrong folder, you will not be able to find your data in your folder later and you will need to upload the files again. Modular Cloning and Sequencing: A Six-Week Project in Molecular Biology and Bioinformatics 151 Exercise 2.3 Review the Assembly with FinchTV 1. Find your contig and assembly details file in your iFinch folder. 2. Click the filenames to download the files to your computer, if they aren’t already there. 3. Open the “assembly details” file with any text-editing program or Microsoft Word®. 4. Print this file so that you can mark discrepant positions with a highlighter for review. 5. Open up your iFinch folder. 6. Begin editing by finding and reviewing indels. To find indels in your assembly details, search your assembly file for “_” characters. Note positions where indels are located. 7. Review the sequences by eye to identify other positions where the bases disagree. Note the positions of these discrepancies. 8. To review a read, find the sequence in your folder and click the “FinchTV” icon to open the read in FinchTV. 9. Examine the assembly detail file to determine how the read is oriented relative to the consensus sequence. • If the assembly details show a “+” after the name of your read, then your read sequence will be the same sequence that is shown in FinchTV. • If there’s a “-“ sign after the name of your read, you will need to obtain the reverse complement for your sequence. If that’s the case, click the FinchTV reverse complement button , to get the reverse complement of the read. 10. Now, copy a portion of the sequence that appears on the 5′ side of the discrepancy as shown in the example below. 11. Next, paste your copied sequence in the FinchTV “Find Sequence” window. 152 ABLE 2008 Proceedings Vol. 30 Robinson, Lau, Porter, Wiseman & Woodrow 12. Click the return key (or enter) key to find the sequence and highlight it as shown below. We can now examine the problem base. Looking at the trace, it’s pretty clear that one of the reads contained an error and the indel should have been an “A.” This should be changed. If you can’t decide which read is correct just let them remain the way they are. Don’t try to force these sequences to match. We have other clones from this species to compare them to. 13. If other reads align in a questionable region, check those traces too, and confirm your results. To do this, we find the other read that align at that position, and showed the deletion, and check the base in the same way. You can have several reads showing on the computer at the same time to make this job easier. Modular Cloning and Sequencing: A Six-Week Project in Molecular Biology and Bioinformatics 153 When we look at the other read, we can see that it does have an “a” at the discrepant position but it was missed by the base caller. So we can be confident that the “A” is correct. Exercise 2.4 Edit Reads with FinchTV In the previous example, our consensus sequence was correct even though one of the reads contained a mistake. What should we do if the consensus sequence is mistaken? If the consensus sequence is wrong, we need to correct the mistake and reassemble the reads. We can correct the mistake by editing the read sequence in FinchTV. 1. Find the read that you need to edit. 2. Open the trace in FinchTV and locate the questionable region of sequence. 3. To change a base, click the base to highlight it and type a new base. 4. To insert a base, click the base on the 3′ side of the insertion point. Select “Insert before base” from the edit menu and enter the base. 5. To delete a base, click the base and click the delete button on your keyboard. to save your edits back to the iFinch database. This process creates a record of any changes that are 154 ABLE 2008 Proceedings Vol. 30 Robinson, Lau, Porter, Wiseman & Woodrow made to the read sequence. Any edits can be reviewed by selecting the “View Revision History” link found on the “Chromatogram Read” page. 6. Continue reviewing and editing all the discrepancies in your assembly until you are satisfied that each read shows the correct sequence. Exit the file when done; this will prompt the computer to save the changes for you. Exercise Reassemble Reads 2.5 Edited If you found cases where the correct sequence was different from the consensus, and you edited the read file in FinchTV, you can obtain a corrected contig sequence by carrying out another assembly. Note: your new assembly results will show edited bases as lower case letters. Follow the same procedure as in Exercise 2.1 (University of Lyon’s website), and save your files back to your folder in iFinch. Give this file a different name “contig-edited” with your name and plant. Review Questions 1. What is a contig? 2. Why is it important to have multiple reads of the area of a gene when assembling a contig? 3. What does it mean if a read has a “-“ sign after the name? Exercise 3.1 Download FASTA Formatted Sequences and Use BLASTn to Verify the Identity Modular Cloning and Sequencing: A Six-Week Project in Molecular Biology and Bioinformatics 155 As previously mentioned, distantly-related sequences might not match the Arabidopsis sequences that we’ve put into iFinch for screening. To ensure that you’re finding everything you can, download the FASTA sequences from your high quality clones and compare them to the GAPDH sequences in the GenBank. The same thing can be done with the contig assembly. A FASTAformatted sequence, is a sequence begins like this: >sequence_name GAGAGAGATAGATAGATAGATAGAGAGCCCCGCGAG The sequence begins with a “>” symbol, then the sequence name without spaces, a new line, and the DNA sequence in text without paragraph formatting. If your contig does not begin with a “>” symbol, arrange it so it does. Our objective here is to verify the identity of our contig and to examine its similarity to published sequences. 1. Click the “Geospiza Finch” in the top left corner to return to the iFinch home page. 2. Click the link to “NCBI BLAST”. In the BLAST page, look under the BLAST heading and choose “nucleotide BLAST”. 3. Click the “Browse” button on the form to locate your contig sequence file and upload it (see example below). 4. Next, open up the database pull-down list and select “Reference genomic sequences (refseq_genomic)”. We need to use the genomic database because these subject sequences will include genomic coordinates, where others will not. We need those coordinates in order to distinguish between the different members of the Arabidopsis GAPDH gene family and determine which one gene is most similar to ours. If we use other databases, we will not be able to distinguish between matches to the different members of the GAPDH gene family. 5. Click the button to choose “BLASTn” as the program. BLASTn is more sensitive than the other programs and allows you to find “somewhat similar sequences”. Note: BLASTn differs from megablast and discontiguous megablast by using a smaller word size. As consequence, BLASTn is more sensitive and can find more distantly related sequences. Of course there’s a price for sensitivity, BLASTn is also more likely to find matches that occur by random chance. 156 ABLE 2008 Proceedings Vol. 30 Robinson, Lau, Porter, Wiseman & Woodrow 6. Type “plants” in the Organism text box. As you type, different selections from the Taxonomy database will appear. Choose “plants” from the list. Modular Cloning and Sequencing: A Six-Week Project in Molecular Biology and Bioinformatics 157 7. Click the blue “BLAST” button. After a few moments, the results will appear. Be sure to read through Section 4.2 and 4.3 below before interpreting your results. Note: If you sign up for an NCBI account (top right corner) and log in, NCBI will store your BLAST results for 36 hours. You can view your results by logging in and selecting the “Recent Results” tab. 158 ABLE 2008 Proceedings Vol. 30 Robinson, Lau, Porter, Wiseman & Woodrow Exercise 3.2 Calculating the Best Match for Your Contig. Run your contig in BLASTn as done previously. Fill-in whichever table(s) below is appropriate. Don’t worry if the Arabidopsis sequence is in the “-“ orientation (where the beginning value is higher than the ending value). Gene GAPC Arabidopsis chromosome Begin End > 1,080,957 < 1,083,357 Begin End > 4,608,193 < 4,610,644 Begin End > 29,920,795 < 29,924,127 Begin End > 5,574,304 < 5,577,616 Query begin Query end Score Query begin Query end Score Query begin Query end Score Query begin Query end Score 3 Total score Gene GAPC-2 Arabidopsis chromosome 1 Total score Gene GAPCP-1 Arabidopsis chromosome 1 Total score Gene GAPCP-2 Total score Arabidopsis chromosome 1 Modular Cloning and Sequencing: A Six-Week Project in Molecular Biology and Bioinformatics 159 Review Questions 1. Does an “E value” of zero mean that your sequence matched the subject sequence well or poorly? Explain your answer. 2. What would it mean if you found a subject sequence with an “E value” of 3? 3. Why did we search the reference genomic database? Exercise 4.1 Use BLASTn to Align the Contig to Reference mRNA Sequences 1. Locate your contig sequence that should now be on your desktop. 2. Go to the iFinch home page and click the “NCBI BLAST” link. 3. Select “nucleotide BLAST” from the BLAST home page. 4. Click the “Browse” button to locate your contig sequence on your desktop and upload it for BLASTing. 5. Select “Reference mRNA sequence (refseq_rna)” for the database. 6. Type “plants” in the Organism box. 7. Select “BLASTn” (“Somewhat similar sequences”) as the program. 8. Click the “Algorithm parameters” link at the bottom of the BLAST page. 9. Change “Max target” sequence from “100” to “10”. This change will make the results easier to interpret because fewer sequences will be shown. 160 ABLE 2008 Proceedings Vol. 30 Robinson, Lau, Porter, Wiseman & Woodrow 10. Change the “Word size” to “7”. This will increase the sensitivity of the BLASTn search and allow us to detect more distantly related sequences and short exons. 11. Click “BLAST” and your results should appear after a few moments. Exercise 4.2 Reformat Your BLASTn Results 1. First, you will need to change the formatting of your BLASTn results to make the exon boundaries easier to spot. Select “Reformat these Results” from the top of the page. 2. The “Format Request” page will appear. 3. There are a number of settings that you will change on this page. a. Change the “Show” setting to “Plain Text.” b. Change the “Alignment View” to “Query-anchored with dots for identities.” Modular Cloning and Sequencing: A Six-Week Project in Molecular Biology and Bioinformatics 161 c. Change all of the values in the “Limit results” row to “10”. This change will make the results easier to read. 4. Click the “View” report button to see your re-formatted BLASTn results. 5. Save your results as a .txt file. Since you may be working with this file for some time, you may find if helpful to upload these results and store them in your iFinch folder for convenience. Print one copy of these results to hand-in. Keep this file open for now. Exercise 4.3 Find the Exon-Intron Boundaries 1. Open the contig.txt file that is on your desktop and copy-and-paste all of the text into a blank Microsoft Word® document. Change your left and right page margins to 0.3’. 2. Insert a blank space between the contig title and the DNA sequence, but make sure your sequence is still in a FASTA format (i.e. begins with “>”). Keep this file open. 3. A query-anchored alignment shows your query sequence (in this case, your contig) at the top. The alignments to all other sequences are shown below with identical bases shown as dots. For this step, we will find and work with each exon in turn. For each exon, we will locate the beginning of the exon in the contig sequence and mark it (in the Word® document) by making the text a different color. Then we will find the 3´ end of the exon sequence and mark where it’s located in the contig. Last, we will color all the text in between the 5´ and 3´ ends of the exon. When we are through, each exon will be shown in a different color than the rest of the text, making it easier to pick out those regions of sequence. We are using an 162 ABLE 2008 Proceedings Vol. 30 Robinson, Lau, Porter, Wiseman & Woodrow example below from an experiment to clone GAPC from Salvia. Your results will be different. 4. You should have two files open now: the BLAST alignment (.txt) and the contig you saved in Word®. An example of the BLAST alignment is shown directly below. You can see that the first exon begins at position 6 in our contig and reads AGCC. 5. Find the first AGCC in our contig sequence (this is in the Word® file), beginning at position 6, and color it red to mark the beginning of the exon. 6. Now use the BLAST alignment to find the 3′ end of the exon. Scan the BLAST results until the aligned region appears to end. In this example, there are five possible locations where the first exon might end. Since the second sequence is from Arabidopsis, and Arabidopsis is probably the best characterized plant genome, we’ll guess for now, that this is the end point for our exon. Our later results will help us determine if this is correct or not. Modular Cloning and Sequencing: A Six-Week Project in Molecular Biology and Bioinformatics 163 7. Now you will need to find the location of the 3′ end in your contig. To do this, it’s helpful to use your mouse to select and copy a section of the query sequence (about 20-25) bases from the BLAST alignment that precedes the 3′ end (shown below). 8. Use that copied sequence and the “Find” command to search your Word® document and locate this sequence in your contig. You can use Control-V to paste this sequence into the “Find” box. 9. When you find the sequence, change the color of the text to mark it. 10. Now select all the bases between the beginning and end of the exon in the contig.doc file, and change the color for those bases too. 11. Repeat this process for the remaining exons, marking the 5′ end, then the 3′ end, and then coloring the bases in between. Be sure to mark the different exons with different colors of text. Save this altered file. 12. When you are through your contig sequence will look something like this (except in color): 164 ABLE 2008 Proceedings Vol. 30 Robinson, Lau, Porter, Wiseman & Woodrow 13. It may be helpful to store your sequences in your iFinch folder between class sessions. Make certain that you keep your sequence in a FASTA format. Exercise 4.4 Check Your Proposed mRNA with BLASTn Now, it’s time to check your work by doing a BLASTn search with your proposed mRNA sequence. 1. In the contig.doc file create the proposed mRNA sequence by copying-and-pasting the colored regions only; putting them together in the same order to create a model for your mRNA. Do this on the same page as the original contig sequence. When you are done, this second sequence should not have any black letters. Make sure that your sequence is in a FASTA format. It should look something like this: 2. Go to NCBI and select “nucleotide BLAST”. 3. Copy your mRNA sequence from your Word® document and paste it into the BLASTn search box. 4. Choose the “Reference mRNA sequences” as your database. 5. Type “plant” for the Organism, and select the “plants” option. 6. Choose BLASTn as the program. 7. Click “Algorithm parameters” and choose “7” for the Word size. Click “BLAST”. Modular Cloning and Sequencing: A Six-Week Project in Molecular Biology and Bioinformatics 165 Exercise 4.5 Interpreting Results The results from our example mRNA sequence are shown below. When you did a BLASTn search before, your query sequence contained segments (introns) that wouldn’t be found in mRNAs. Since this time you used a possible mRNA segment as a query, your BLAST results should not show gaps. We can see that our proposed mRNA matches several subject sequences along the entire length without any gaps. This is because the mRNA sequence should be in a single reading frame from beginning to end. Your results may be similar or you may find breaks in the sequence where a portion of sequence is missing or differs between plant species. Ultimately, the DNA sequence of all the clones from a single species will be aligned with each other to further check for accuracy. Next, you will need to examine your results in further detail. 1. Reformat the BLAST results, as described earlier in Exercise 4.2 so that BLAST results show a query-anchored alignment from ten subject sequences. 2. Scroll through the aligned sequences and look for possible discrepancies. The query sequence appears to contain four more bases than the reference mRNAs. Dashes indicate missing bases. These discrepancies could be errors that are indels, or from improperly joined exons. It could also be due to sequencing errors. They could also be due to differences between plant species. 3. Print out these results. 166 ABLE 2008 Proceedings Vol. 30 Robinson, Lau, Porter, Wiseman & Woodrow 4. One way to check these discrepancies is to compare the sequence of your GAPDH clone with that of other GAPDH clones from the same plant. We can do this since several of your classmates worked with the same species. Email the instructor your final, corrected contig so they can post it. Later you will compare them to each other (Exercise 6.1). Review Questions 1. What kinds of sequences will we find in a genomic sequence: exons, introns, or both? _______ 2. What kinds of sequences will we find in an mRNA: exons, introns, or both? _______ Exercise 4.6 Translating the Predicted mRNA Sequence and Checking with BLASTx Now we can predict the amino acid sequence of your contig. BLASTx translates nucleotide sequences in all six possible reading frames (three for each strand). It also is capable of comparing each of these putative amino acid sequences to a database of protein sequences published in the GenBank. This is helpful since we don’t know which reading frame is the correct one. 1. Go to the “NCBI BLAST” home page. 2. Select “BLASTx”. 3. Copy the sequence of your final mRNA model and paste it in the query box. 4. Choose “Non-redundant protein sequences (nr)” for the database. 5. Choose “plants” as the organism. 6. Click “BLAST”. 7. If your mRNA model is correct, you should see the following results: • The alignment should span the entire length of your query sequence, from 0 to the end. • The entire sequence should in a single reading frame, and there should not be any gaps in the subject lines. If the top BLAST hit line is not continuous then the contig submitted should be re-evaluated, because it still contains one or more base insertions. Modular Cloning and Sequencing: A Six-Week Project in Molecular Biology and Bioinformatics 167 Review Questions 1. BLASTx translates a nucleotide sequence in six reading frames and then uses each one to query a protein database. Why are there six possible reading frames? 2. In the BLASTx results the letters are no longer limited to A, G, C, or T. What do letters in the BLASTx results represent? 168 ABLE 2008 Proceedings Vol. 30 Robinson, Lau, Porter, Wiseman & Woodrow Exercise 4.7 Protein Translate your mRNA Sequence to Obtain the Predicted Sequence for your 1. Find the sequence of your final mRNA in your folder and click the file name to download and open the file. 2. Make sure that the mRNA is sill in a FASTA format. 3. Select and copy the mRNA sequence. 4. Return to the iFinch home page. 5. Select “Sequence Utilities” from the home page. 6. A page from the “BCM Search Launcher” will appear. The “BCM Search Launcher” is operated by the Baylor College of Medicine in Texas. There are several programs at this site that you can use for manipulating and working with DNA sequences. 7. Paste your mRNA sequence in the text box. Make sure the sequence is preceded by “>”. 8. Select “6 Frame Translation” from the choices in the list. This will translate your mRNA sequence in all six reading frames. Modular Cloning and Sequencing: A Six-Week Project in Molecular Biology and Bioinformatics 169 What is your correct reading frame? Do this by comparing the output with the BLASTx results in the previous section. Also, remember that reading frames contain no stop codons. _________ 9. Copy the amino acid sequence that you think is the correct translation and use it as a query with BLASTp (BLASTp uses a protein sequence to search a protein database) from the NCBI BLAST site. Set the search parameter to provide 250 max target sequences. If you’ve picked the correct picked the correct sequence, it should match the correct protein. This output can be used in the next exercise on phylogenetics. Review Questions 1. Compare the results from the BLASTx search with the results from the BLASTp search. How are the matches compare? What accounts for discrepancies? Exercise 5.1 Phylogenetics On the BLASTp results page examine the link called “Distance Tree of Results”. Each accession listed in this output is connected by a line to other accessions. The shorter the line the more similar the sequence. Common ancestors are indicated where two or more lines converge. Therefore, this tree shows evolutionary relationships. The gray bullets in the output are mostly for predicted or unknown proteins, while the colored bullets are better characterized. The species can be established by placing the cursor over the colored bullets. 170 ABLE 2008 Proceedings Vol. 30 Robinson, Lau, Porter, Wiseman & Woodrow 1. List three plant species that have sequences that are most closely related to your gene. What category do these closely-related plant species belong? 2. List three plant species that have sequences that are least related to your gene. What category do these distantly-related plant species belong? Exercise 6.1 GAPDH Sequence Alignment 1. Go to the Dolan DNA Learning Center www.bioservers.org and open the BioServers website. 2. In the left-hand column, hit the “Sequence Server.” (If you want, you can register and save your work). 3. To create GAPDH sequences for comparison: A. Click the “Create Sequence” button. B. Copy the final contig sequence of your clone and paste it into the Sequence window. Enter a name for the sequence (like “Joe Tomato”), and click “OK”. When it has been saved you can hit the “Cancel” button to see it listed. C. Repeat with all of the other final contigs for that same species of plant. D. Check the boxes on the left of the sequences you want to compare. E. Click the “Compare” button. Your sequences are sent to Cold Spring Harbor Laboratory, NY, where they will be aligned using CLUSTALW. F. A new window will appear with your results. G. To view the entire gene, change the parameter so that all of the bases show on the page. Click “Redraw.” Yellow highlighting denotes disagreements between sequences. H. Print results, and mark discrepancies with a highlighter pen. Modular Cloning and Sequencing: A Six-Week Project in Molecular Biology and Bioinformatics 171 4. Are there discrepancies between the different clones? If so, what type of differences do you see? 5. Why do you think there might be discrepancies between these different clones for the GAPDH gene from the same plant? Give at least three reasonable explanations. Notes for Instructor Detailed instructions for media preparation, setting up PCR reactions, etc. are provided at explorer.bio-rad.com. See Appendices for examples of questions, discussion ideas, and for supplementary ways to analyze results. Materials In addition to the Bio-Rad kit, the following equipment and supplies are needed for a lab section of 16 students (2 students per group; 8 groups per lab section): Equipment Thermal cycler Micopipettors (0.5-2 µl; 2-20 µl; 20-200 µl; 200-1000 µl): 37oC water bath or heating block: 37oC incubator Gel electrophoresis apparatus (32 wells): Electrophoresis power supply: Gel documentation system Microcentrifuge: Microwave: Computers -20oC freezer Balance Supplies Filtered pipette tips (all sizes): Tube racks Cold block or chipped ice Optional: Gels containing EtBr can be purchased from Bio-Rad 1 8 1 1 1 1 1 1 1 8 1 1 4 boxes 4 4 172 ABLE 2008 Proceedings Vol. 30 Robinson, Lau, Porter, Wiseman & Woodrow Safety precautions and disposal: Consult with your campus Chemical Safety Division for guidelines for disposal of gels stained with ethidium bromide (EtBr) or SYBR Green. Recombinant bacteria should autoclaved or bleached before disposal. Acknowledgements Special thanks to Bellarmine University students Kathryne Blair, Catherine Brumm, Stephanie Kortyka, Stephanie Mitchell, Melissa Pawley, Emily Whitledge and Sanda Zolj for their hours spent performing this laboratory exercise and for their helpful suggestions. Thanks to all our past Molecular Biology students at Bellarmine University who have been instrumental in helping us develop this exercise. The collaboration of the Joint Genome Institute (Department of Energy) in Walnut Creek, CA is much appreciated. Literature Cited Altenberg, B., and Greulich, KO. 2005. Genes of glycolysis are ubiquitously overexpressed in 24 cancer classes. Genomics 84: 1014-1020. Altschul, SF, Gish, W., Miller, W., Myers, EW and Lipman, DJ. 1990. Basic local alignment search tool. Journal of Molecular Biology 215: 403-410. Benson, D.A., I. Karsch-Mizrachi, D.J. Lipman, J. Ostell, and D.L. Wheeler. 2007. GenBank. Nucleic Acids Research, 35: D21-D25. Dennis, D.T. and S.D. Blakely. 2000. Carbohydrate Metabolism (Chapter 13). In: Biochemistry and Molecular Biology of Plants (Eds. Buchanan, B.B., W. Gruissem, R.L. Jones), American Society of Plant Physiologists, Rockville, MD. Figge, R.M., Schubert, M., Brinkmann, H., and Cerff, R. 1999. Glyceraldehyde-3-phosphate dehydrogenase gene diversity in eubacteria and eukaryotes: Evidence for intra- and inter-kingdom gene transfer. Molecular Biology Evolution 16: 429-440. Kim, JW. and Dang, CV. 2005. Multifaceted role of glycolytic enzymes. Trends in Biochemical Science 30: 142-150. Olsen, K.W., D. Moras, and M.G. Rossmann. 1975. Sequence variability and structure of dglyceraldehyde-3-phosphate dehydrogenase. Journal of Biological Chemistry 250: 9313-9321. Sanger, F., and A.R. Coulson. 1975. A rapid method for determining sequences in DNA by primed synthesis with DNA polymerase. Journal of Molecular Biology 25: 441-448. Modular Cloning and Sequencing: A Six-Week Project in Molecular Biology and Bioinformatics Sirover, M.A. 1999. New insights into an old protein: the functional diversity of mammalian glyceraldehyde-3-phosphate dehydrogenase. Biochimica et Biophysica Acta 1432: 159-184. Tatton, W.G., R.M. Chalmers-Redman, M. Elstner, W. Leesch, F.B. Jagodzinski, D.P. Stupak, M.M. Sugrue, and N.A. Tatton. 2000. Glyceraldehyde-3-phosphate dehydrogenase in neurodegeneration and apoptosis signaling. Journal of Neural Transmission, Supplementa, 60: 77-100. 173 174 ABLE 2008 Proceedings Vol. 30 Robinson, Lau, Porter, Wiseman & Woodrow Appendix A Example of results from initial and nested PCR with student questions/discussion topics. G+ = genomic DNA (Arabidopsis), positive control P+ = plasmid DNA (Arabidopsis GAPDH), positive control W- = water, negative control I = initial PCR N = nested PCR Questions: 1) Why did we include positive controls in this experiment? 2) Why did we include a negative control in this experiment? Questions 3 – 7 pertain to the positive control using genomic DNA (Arabidopsis): 3) How many bands are present after the initial PCR? 4) What are the approximate sizes of the different bands produced in the initial PCR? 5) Why are there multiple bands present after the initial PCR? 6) How many bands are present after the nested PCR? 7) Briefly explain the differences seen in the initial vs. the nested PCR reactions? 8) Describe the sizes and intensities of the bands produced in your initial vs. nested PCR reactions. 9) Explain the differences between your initial and nested PCR reactions. Modular Cloning and Sequencing: A Six-Week Project in Molecular Biology and Bioinformatics Appendix B Example of results from a restriction enzyme digest of recombinant plasmid with student questions. MW ladder sizes: 10, 8, 6, 5, 4, 3, 2.5, 2, 1.5, 1, and 0.5 kb Questions: 1) Describe the function of the Restriction Enzyme (R.E.) used in this experiment. 2) Using semi-log paper, predict the sizes of the bands present in your lanes. How does this compare to other students who are working with the same plant species? Notes to instructor…..Additional analysis for class discussion: 1) Explain the results seen in Lane 4? 2) Do you think that the clone examined in Lane 7 is for GAPDH? 3) Do you see any lanes that show more than 2 bands? 175 176 ABLE 2008 Proceedings Vol. 30 Robinson, Lau, Porter, Wiseman & Woodrow 4) What are some possible explanations for 3 or more bands? Modular Cloning and Sequencing: A Six-Week Project in Molecular Biology and Bioinformatics 177 Appendix C Example of a dot-plot for pair-wise comparison of two sequences. In this example, the two sequences being aligned are the phrases “KEY LIME PIE” and “BIG MICKEY”. In the grid below, an “X” was placed wherever the top and side phrases contained the same letter. The longest diagonal formed by the contiguous “Xs” indicates the region of greatest identity, in this example “KEY”. This is a simplified example of how a longer contig can be predicted from shorter sequences that contain overlapping regions. KEY LIME PIE K E Y L B I G M I C K X E X Y X BIG MICKEY I M E P X I X X X X X The final sequence (contig) in this example would be “BIG MICKEY LIME PIE”. E 178 ABLE 2008 Proceedings Vol. 30 Robinson, Lau, Porter, Wiseman & Woodrow Appendix D Partial example of the output from a contig using the “CAP3” assembly server at the University of Lyon. QCDP866765.b2_I19.ab+ AAATATTAATATTGACAAGTATTACGAGCATGAAAAACATAATACAAGAATGGGTGCAAA QCDP866758.b2_K17.abAAATATTAATATTGACAAGTATTACGAGCATGAAAAACATAATACAAGAATGGGTGCAAA QCDP866715.b2_E07.abTGGGTGCAAA ____________________________________________________________ consensus AAATATTAATATTGACAAGTATTACGAGCATGAAAAACATAATACAAGAATGGGTGCAAA . : . : . : . : . : QCDP866765.b2_I19.ab+ CTTACTGCAAAGCATATCACAAAAAAACAAACTGAGACAGCTTAAATTAAATGCTTAAGG QCDP866758.b2_K17.abCTTACTGCAAAGCATATCACAAAAAAACAAACTGAGACAGCTTAAATTAAATGCTTAAGG QCDP866715.b2_E07.ab- CT-ACTGCAA-GCATATCACAAAAAA-CAAACTGAGACAGCTAAATTAAATGCTTAAGG : . . : . . : . ____________________________________________________________ consensus CTTACTGCAAAGCATATCACAAAAAAACAAACTGAGACAGCTTAAATTAAATGCTTAAGG . : . : . : . : : QCDP866765.b2_I19.ab+ GGGTGCCATGTCCACGCACTGTTTTACCAAAGAATGAGAAAAGGTAACAGACAAATGGAC QCDP866758.b2_K17.abGGGTGCCATGTCCACGCACTGTTTTACCAAAGAATGAGAAAAGGTAACAGACAAATGGAC QCDP866715.b2_E07.abGGGTGCCATGTCCACGCACTGTTTTACCAAAGAATGAGAAAAGGTAACAGACAAATGGAC ____________________________________________________________ consensus GGGTGCCATGTCCACGCACTGTTTTACCAAAGAATGAGAAAAGGTAACAGACAAATGGAC . : . : . : . : : QCDP866765.b2_I19.ab+ ATGTAGCAATTACAGCATGAATACCTTGGCAGCACCAGTGCTGCTGGGAATGATGTTGAA QCDP866758.b2_K17.abATGTAGCAATTACAGCATGAATACCTTGGCAGCACCAGTGCTGCTGGGAATGATGTTGAA Modular Cloning and Sequencing: A Six-Week Project in Molecular Biology and Bioinformatics QCDP866715.b2_E07.abATGTAGCAATTACAGCATGAATACCTTGGCAGCACCAGTGCTGCTGGGAATGATGTTGAA ____________________________________________________________ consensus ATGTAGCAATTACAGCATGAATACCTTGGCAGCACCAGTGCTGCTGGGAATGATGTTGAA . : . : . : . : : QCDP866765.b2_I19.ab+ GCTAGCAGCCCTTCCACCTCTCCAGTCCTTGGAAGAGGGACCCATCACAGTTTTCTGGGT QCDP866758.b2_K17.abGCTAGCAGCCCTTCCACCTCTCCAGTCCTTGGAAGAGGGACCATCAACAGTTTTCTGGGT QCDP866715.b2_E07.abGCTAGCAGCCCTTCCACCTCTCCAGTCCTTGGAAGAGGGACCATCAACAGTTTTCTGGGT QCDP866751.b2_M15.ab+ TCAACAGTTTTCTGGGT . : . . : . . : . ____________________________________________________________ consensus GCTAGCAGCCCTTCCACCTCTCCAGTCCTTGGAAGAGGGACCATCAACAGTTTTCTGGGT . : . : . : . : : QCDP866765.b2_I19.ab+ AGCTGAAAAAAAAAATGC QCDP866758.b2_K17.abAGCTGAAAAAAAAAATGCAAAATCCCATGTAAATAAGCATAGCCTTGCATTAAAGTACTT QCDP866715.b2_E07.abAGCTGAAAAAAAAAATGCAAAATCCCATGTAAATAAGCATAGCCTTGCATTAAAGTACTT QCDP866751.b2_M15.ab+ AGCTGAAAAAAAAATGCAAAATCCCATGTAAATAAGCATAGCCTTGCATTAAAGTACTT ____________________________________________________________ consensus AGCTGAAAAAAAAAATGCAAAATCCCATGTAAATAAGCATAGCCTTGCATTAAAGTACTT . : . : . : . : : QCDP866758.b2_K17.ab- AT-ACACACCATG QCDP866715.b2_E07.abATTACACACCATGTTGTTTTAAGTCAACAAAATCATCAAATACCAGTGATAGAGTGCACG QCDP866751.b2_M15.ab+ ATTACACACCATGTTGTTTTAAGTCAACAAAATCATCAAATACCAGTGATAGAGTGCACG ____________________________________________________________ consensus ATTACACACCATGTTGTTTTAAGTCAACAAAATCATCAAATACCAGTGATAGAGTGCACG 179 180 ABLE 2008 Proceedings Vol. 30 Robinson, Lau, Porter, Wiseman & Woodrow Appendix E An additional method to confirm introns and exons is the “NetPlantGene Server”, provided by the Technical University of Denmark. The “NetPlantGene Server” is an artificial neural network designed to predict intron splice sites in Arabidopsis (Hebsgaard et al., 1996)*. It is freely available on the web. 1. Go to the “NetPlantGene Server” at http://www.cbs.dtu.dk/services/NetPGene/ 2. Paste your final contig sequence into the sequence box. 3. Hit “Submit Sequence”. 4. Below is a partial example of an output. It shows the locations of the predicted intron/exon splice sites and the relative confidence levels for those predictions. Length: 1365 nucleotides. 24.1% A, 20.1% C, 19.9% G, 36.0% T, 0.0% X, 39.9% G+C Donor splice sites, direct strand --------------------------------pos 5'->3' phase strand confidence 293 0 + 1.00 470 1 + 0.96 669 0 + 1.00 1036 2 + 1.00 1204 2 + 0.96 5' exon intron 3' CCTTGCCAAG^GTAATTCTTG H TCTATCACTG^GTATTTGATG TGCTGCCAAG^GTATTCATGC H CTGCCATCAA^GTGAGTTATC H GTGACAGCAG^GTACCTTCAC Acceptor splice sites, direct strand -----------------------------------pos 5'->3' phase strand confidence 145 0 + 0.96 408 0 + 0.87 570 1 + 0.96 734 0 + 0.78 892 0 + 1.00 1119 2 + 1.00 1336 + 0.00 5' intron exon 3' TTTGAAATAG^GGCGGTGCCA ATATTTGCAG^GTCATTAATG TTTTTTTCAG^CTACCCAGAA GGTAAAACAG^TGCGTGGACA CTCCTTGCAG^GCTGTTGGAA H GCCTTTTCAG^GGCCGAGTCT H TTAAAAACAG^GTCTAGCATC Modular Cloning and Sequencing: A Six-Week Project in Molecular Biology and Bioinformatics *Hebsgaard, S.M., P.G. Korning, N. Tolstrup, J. Engelbrecht, P. Rouze, and S. Brunak. 1996. Splice site prediction in Arabidopsis thaliana pre-mRNA by combining local and global sequence information. Nucleic Acids Research 24:3439-3452. 181 182 ABLE 2008 Proceedings Vol. 30 Robinson, Lau, Porter, Wiseman & Woodrow Appendix F Intron/exon locations can be examined further by studying genomic GAPDH sequences that have already been published in the GenBank. The mRNA feature for these accessions lists the base-pair locations of each exon. These could be similar in homologous sequences. A partial rice sequence is shown below, but you would be using the GenBank sequence that is most similar to the GAPDH sequence you are studying (as revealed by BLAST). In this example, exons have been underlined. LOCUS DEFINITION ACCESSION VERSION SOURCE ORGANISM AUTHORS Habara,T., TITLE ssp. JOURNAL NC_008397 3602 bp DNA linear PLN 19-FEB-2008 Oryza sativa (japonica cultivar-group) genomic DNA, chromosome 4. NC_008397 REGION: 24280691.24284292 NC_008397.1 GI:115461545 Oryza sativa Japonica Group Oryza sativa Japonica Group Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; Liliopsida; Poales; Poaceae; BEP clade; Ehrhartoideae; Oryzeae; Oryza. Ohyanagi,H., Tanaka,T., Sakai,H., Shigemoto,Y., Yamaguchi,K., Fujii,Y., Antonio,B.A., Nagamura,Y., Imanishi,T., Ikeo,K., Itoh,T., Gojobori,T. and Sasaki,T. The Rice Annotation Project Database (RAP-DB): hub for Oryza sativ Nucleic Acids Res. 34 (DATABASE ISSUE), D741-D744 (2006) FEATURES source Location/Qualifiers 1..3602 /organism="Oryza sativa Japonica Group" /mol_type="genomic DNA" /cultivar="Nipponbare" /db_xref="taxon:39947" /chromosome="4" gene 1..3602 /gene="Os04g0486600" /db_xref="GeneID:4336216" mRNA join(1..86,216..239,333..433,977..1092,1179..1278,1378..1524, 1639..1699,1790..1887, 2428..2570,2684..2767,2848..2937,3328..3602) /db_xref="GeneID:4336216" /note="Glyceraldehyde-3-phosphate dehydrogenase, cytosolic 1141 1201 1261 1321 1381 1441 1501 1561 1621 1681 1741 1801 agtataccat ggtgagactg gccgctgctc gtcataacaa ggtgctaaga gtcaatgaga aactgccttg caaattgtgt caaattgtct actgtccatg gcgttgctta accgttgatg atcgctcaac gcgctgagtt acctgaaggt ttcatatgaa aggtcgtcat aggagtacaa ctccacttgc ggtgatgtta gcctgcaggt caatcactgg gcatttcttt gaccctcgag cttaatttct tgttgtggag attttctgcg gaatgattct ctctgctccc gcctgacatc caaggtgtgt agaaattggt tatcaatgac tatgatcttt gatgtaacac caaggactgg ctgctcagga tccactggtg aaacccaaca catggtatca agcaaggatg gacattgtgt tcctcactca cactactacc aggtttggta gaagtacctg tgtgctgtaa aggggtggaa accctgagga ttttcactga tgtcattgtt tatgcgtttc cccccatgtt ccaatgctag ttttcacttg atatgaagcc ttgttgaggg tgatgttctt ttcttacagc gggctgccag gatcccatgg caaggacaag tgattgtggt aatatagggt tgttgttggt ctgcaccacc tctactgttg ttgtccttat tttgatgacc cattgttgtg aactcagaag tttcaacatc Modular Cloning and Sequencing: A Six-Week Project in Molecular Biology and Bioinformatics 1861 1921 1981 2041 2101 2161 2221 2281 2341 2401 2461 2521 2581 2641 2701 2761 2821 2881 2941 atccctagca gaggtcagcc tgattagtcg acctttgtag tacatgtcag ccctatcaga tttctgcctc gaagtgcaac tttgtttata tttaactttt aagttgactg gtcaggcttg cataaattag catctaagac aactcaaggg acaacaggtg cctgacctct tctcaatgac atcatttcta gcactggagc tccatgatga ctcacttaag catcaacaat acaaaatttt gaaggttcaa caaatgcttc cgtatgcatc gtcttttcgc gctataaaac gtatggcttt agaagccagc aattgccctt tagcaattga tattctgggt gagatctgtt tttgctctgt aactttgtta tgtgcattag tgcaaaggta tttctgtctg agtcaactct gatttgttaa taaacacgtt agtacattgt tttgtttgca taagttgtaa ctttgtgcta ttgttaggct ccgtgttcca atcctatgat acagaaagtg cttttgtttg tacgttgagg taaattcgtc gtaacaggtc agcttgtgtc aaagctatac atcatgagat tacatataat ttgttaactt agatctcttt tacttataat cctgaagcaa tcatgtttgt attcaaaatt atcctgtttc gtcggcaagg accgtggatg cagattaagg aagaggctaa ttctgttttc aggaccttgt ttgtatttgg aagcatcttt ttggtatgac ttgtttttta ttaatattat ataggtgcat cactcgagtt gaatgcatta tgaagttaaa ctgtactaat taccttcagc tgagatcatt ggggaaatgt tgcttcctgc tctctgtcgt cagctatcaa agtattgggt tagggaggag ttccacagac atactcggtt gatgcaaagg aacgaatggg gttcatggtt cccatatgct tagttagtaa ggtaggctca gcatttggtg gctactgact ttgaacacag ataatctctg atttaagaat ttgtacatgt tctcaatgga tgatttgact gtaagtgtag accagaacgt tctgagggga ttccagggtg actcgaagcc ccggtattgc gatacaggta gtttggttca 183