Mark S. Wilson Humboldt State University Mining Genomes Bioinformatics Tutorial: Gene Expression Introduction Genomics studies have indicated that there are about 30,000 different genes in the human genome, and, because of alternate RNA splicing possibilities, these can make about 100,000 different proteins. Not all of these proteins are made in any particular tissue, and the amount of a protein made in one tissue varies with different developmental and physical states. For example, consider the changes that occur during the course of pregnancy, in both the mother and the developing child. The regulation and coordinate expression of genes occurs with an exact and complex timing and progression. In this chapter, you studied the regulation of the lac operon in Escherichia coli. Many groups of researchers worked for years to elucidate the details of how this operon is regulated. Today, new technologies are being developed that allow researchers to easily and rapidly study the coordinate expression of thousands of genes simultaneously. In this tutorial, you will be introduced to one of these technologies, named SAGE (serial analysis of gene expression). Of all the new '-omics' (genomics, proteomics, lipomics), the most exciting and the most promising is the field of transcriptomics—the study of gene expression. Some of the very interesting questions that can be asked with these techniques include basic questions about how our bodies work as well as specific questions concerning the causes and development of diseases. Some of these questions are: - how many genes are expressed in different tissue types? - how many genes are expressed at a high level in different tissue types? - what do we mean by a high level of expression, anyway --10 transcripts/cell? 100 transcripts/cell? 10,000? - what fraction of the total genes are expressed in a given cell type? - how does gene expression change as a cell progresses through the cell cycle? - how does gene expression differ in a disease state? For example, how does gene expression in normal brain cells differ from gene expression in a brain tumor? Many other types of questions can be asked as well. For example, a large fraction of the total genes in the human genome do not have an assigned function; by understanding how a gene is regulated and comparing its expression to known genes, it may be possible to get insights into its function. And if the transcriptome of normal and diseased states is known, it may be possible to individually tailor a treatment program for a patient by monitoring how gene expression changes during the course of treatment, and responding accordingly. Objectives -to introduce the use of SAGE (serial analysis of gene expression) - to introduce the tools available at NCBI for examining gene expression, including Gene Expression Omnibus, SAGEmap, and Unigene. Project We are going to look at SAGE (serial analysis of gene expression), a powerful means of studying gene expression. This is a method that can be used to quantitatively study the expression of thousands of genes simultaneously. During this portion of the project, you can design hypotheses and ask your own questions about patterns of gene expression. Serial Analysis of Gene Expression Microarrays are one approach that can be used to study gene expression (details are available in your text). SAGE is a clever approach that uses a method fundamentally different than that used in microarrays. While microarrays rely on hybridization, SAGE is a sequencing-based approach. A pictorial description of this method can be found at the site http://www.sagenet.org/findings/index.html Essentially, molecular methods are used to isolate a single, specific 10-14 base-pair fragment of every mRNA that is expressed in a cell. The unique fragment of each mRNA is called the SAGEtag of that transcript. These fragments are then linked together into long strands such that each of the fragments is separated by a short 'punctuation' sequence. The nucleotide sequence of one of these strands is determined using highthroughput sequencers. If a gene is turned on, then you will be able to find the unique SAGEtag of its 10-14 bp sequence -- and the number of times that you find that SAGEtag is a direct measure of the abundance of that transcript! This method is slower than microarray hybridization, but as sequencing methods improve it will become more manageable as a research tool. Click here to go the National Center for Biotechnology site, where you will be able to analyze SAGE data. http://www.ncbi.nlm.nih.gov/ Gene Expression Omnibus - GEO On the right-hand side of the page is set of links under the heading Hot Spots. Select the link for Gene expression omnibus. GEO is a site dedicated to providing public access to gene expression data. In addition to serving as a database and repository of different sets of gene expression data, it provides a set of tools that can be used to query the database, or to analyze gene expression data that you have generated. The design of these tools is a challenging undertaking because the data comes from many different labs in many different formats, and is generated using a variety of approaches. Like many of the new fields in bioinformatics, it is likely that the development of flexible intuitive tools will rapidly increase knowledge and understanding in ways that the tool developers frequently can't predict. SAGEmap There is a bar on the right of the page that has links to a variety of tools available at the GEO site. Select SAGEmap. This site is a public repository of SAGE data. Quantitative, whole-genome gene expression profiles, called libraries, are stored here for researchers to analyze. These libraries are primarily from human tissues, and include specific tissues such as brain, colon, kidney, and pancreas. They also include similar tissues that were collected from different sources, such as males and females, diseased and normal tissue, or individuals of different ages. In some cases, gene expression profiles are from pooled tissue samples (for example, cancerous ovarian tissues from young women) in order to reduce effects from individual variations. Because the data is submitted while the projects are ongoing, some of the libraries are more complete than others. Using SAGE resources at NCBI- xProfiler We are going to start by using a program called xProfiler. This program allows you to pick two different tissue types and compare total gene expression in those tissues. For example, you could compare gene expression in normal brain tissue and in cancerous brain tissue. Or, you might choose to compare gene expression in brain tissue from a developing fetal female and a young adult female. Along the right side of the page, select the labeled Analyze...by library. This will take you to the xprofiler site The samples that you compare are simply referred to as Groups A and B. There are a variety of options in choosing the samples for comparison. You could select samples from the list on this page. Alternatively, you could choose the libraries from a full list of the available libraries, by using the link to the Library Browser that is located on the top right of the page. We are going to select two different libraries that should have some clearcut differences in gene expression - leukocytes (white blood cells) and epithelial tissue. At the top of the page, enter white blood cell in the box for 'Group A name' and epithelial in the box for group B name. In Column A, under the heading Homo sapiens, check the box for SAGE Duke leukocyte (48523 tags) Bulk tissue, blood, normal human adult leukocyte total RNA note that the libraries are listed in alphabetical order. The (48523 tags) portion of the entry refers to the total number of SAGEtags that have been sequenced, which is also the total number of specific transcripts that have been analyzed. To get 48,000 tags probably required about 1500 different sequencing reactions. Note that the number of tags may have changed, if more sequences have been obtained for this library, but the rest of the library name should be the same. Next, select two libraries for your B sample by checking the boxes in the B column for SAGE NC1 (50179 tags) Bulk tissue, normal colonic epithelium SAGE NC2 (49593 tags) Bulk tissue, normal colonic epithelium Now, go to the top of the page. The xProfiler program will compare the A library with the B libraries, and return to you a list of SAGEtags that are present at different frequencies in the different libraries. The default setting for the difference in frequencies is a factor of two. For our purposes, we want to change this value to 10. After changing the default to 10, press Calculate (at the bottom of the page). The xProfiler output will load in the same window. If your query gets queued, then you may need to select the button labeled GO. This may take up to a few minutes, depending several factors such as how much activity the site is getting. The SAGEtags are presented as a table with six columns. The first column is the nucleotide sequence of the tag. This is followed by three columns of numbers. The first column of numbers is the number of times that SAGEtag was found in library A while the second column of numbers is the number of times that SAGEtag was found in library B. The next column is a statistical value to help evaluate whether or not the observed difference in expression is due to sampling error, and the fifth column is the Unigene cluster (more on that below) that matches the SAGEtag. The final column in the xProfiler output gives a description of the gene that matches the particular SAGEtag in that row. If the gene has been well characterized, then these descriptions are very specific, however, the descriptions are somewhat vague for genes that haven't been studied very thoroughly. Many of these genes will of course be new to you, but some of them will be familiar. Scroll down the list slowly, and look for proteins that you know. While you are doing this, also try to think about whether or not that pattern of gene expression makes sense, given what you know about leukocytes (white blood cells) and colons. For example, you should see that keratin and cytokeratin are expressed more highly in the colonic tissue, and hemoglobin alpha and hemoglobin beta are expressed more highly in the blood cells. Recall that keratin is a nonenzymatic protein that has structural role in epithelial cells (and makes up the bulk of animal hair); a higher rate of expression would be expected in the colonic tissue. Similarly, hemoglobin is an oxygen-carrying protein present in blood that gives red blood cells their characteristic color. The higher rate of expression in this case is not wholly unexpected, but does raise some questions. Is hemoglobin made in white blood cells? Or, was the white blood cell sample from which this library was derived contaminated with red blood cells? Look at the last column and find a protein that you have heard of, near the top of the list. For our purposes, find 'hemoglobin, beta'. Select the tag sequence of that tag by clicking on the words 'More Unigenes...' below the tag sequence. Using SAGE resources at NCBI- SAGE Tag to Gene Mapping This takes you to a site called the SAGE Tag to Gene Mapping page. This site gives you details on all of the different SAGE libraries that this tag sequence has been found in, including the library name and information on the frequency with which that tag is represented in the library. The simplest way to quickly assess this information is to look at the shaded ovals in the third column. The darker ovals represent high rates of expression, the light ovals represent low levels of expression, while the gray-shaded ovals represent intermediate levels of expression. Because the different libraries are at different levels of completeness, this data is normalized to the expected number of tags that would be found in a library containing a million total tags. The shading is meant to simulate the type of results you might see if you were to carry out a quantitative hybridization analysis, comparing the expression of that gene in the different libraries. The 'Tag counts' and 'Total tags' columns describe how many times the tag has been found in that library, and how many total tags have been found in the library. As of this writing, the only SAGE library that this tag has been found in more than once is the SAGE Duke leukocyte library. At the top of the page is the Unigene Identification number for this gene, in this case Hs.155376. UniGene is a system for partitioning GenBank sequences into a nonredundant set of gene and gene-like entries. Whereas a particular gene may have been sequenced and entered into GenBank a number of different times, from different organisms and by different researchers, that gene will have only one entry in Unigene. This Unigene entry will list the product and function of the gene, if known. The Unigene entry also has information such as the tissue types in which the gene has been expressed, and the map location of the gene. The xProfiler program allows you to compare the expression of all the genes in different tissue samples. As the number of SAGE libraries increases, you can also ask questions about individual genes, for example, concerning which tissues a gene is highly expressed in and which tissues a gene is not expressed in. One of the ways that this can be done is using the SAGE Tag to Gene Mapping site, which you are at now. A more comprehensive look at individual genes can be obtained at the Unigene site for a gene. Unigene Select the UniGene link on the tool bar at the top of the page.This takes you to a page that has a description of Unigene, as well as a search function. Look over the descriptive material. Then , in the search box enter hemoglobin. This search will give you several hits. Select the UniGene entry for hemoglobin beta (symbol = HBB; Entry # Hs.155376). This takes you to the UniGene entry for this gene. At the top of the entry are links to other NCBI sites that have information on this gene, such as LocusLink and OMIM (Online Mendelian Inheritance in Man). Below this are similar proteins that are found in other organisms, such as mice (Mus musculus), and information about how similar the proteins are. For example, the beta hemoglobin of the mouse is 81% identical to the human beta hemoglobin. Below this, in the Mapping Information section, the location of the gene in the genome is identified and linked. Below this is a summary of gene expression information, including SAGE data. The Cancer Genome Anatomy Project Many of the SAGE projects have been carried out as a part of the Cancer Genome Anatomy Project (CGAP), a large effort to determine the differences in gene expression between normal and cancerous tissues and cells. We will finish up this project by using a few of the CGAP tools. To get to the CGAP home page, select the NCBI icon at the top of this page, then select the link to The Cancer Genome Anatomy Project at the top right of the NCBI home page. This page describes some of the different ways that NCBI is helping to make CGAP data available to the public. From this page, select the link to The Cancer Genome Anatomy Project in the first sentence. The URL for the CGAP home page is: http://cgap.nci.nih.gov/ There are several interesting links available from this page. Select the link to SAGE Genie. SAGE Genie The SAGE Genie site lets you query the SAGE libraries in some ways that xProfiler doesn't let you. For example, you can input a specific gene, and then ask how that gene is expressed in different tissues. The output shows expression levels in both normal and cancerous tissues. Let's do this. Select The SAGE anatomic viewer from the middle of the page. In the box next to " Tag (sequence of 10 bases)", enter the Tag for beta hemoglobin, GCAAGAAAGT. Then click Go. At the top of page, select the drawing of the human figure, under the words 'SAGE Anatomic Viewer?' This will return the data for tissue samples. The output portrays the different tissue systems of the human body. The color associated with each tissue is keyed to the relative level of expression in that tissue. Again, you can see that the mRNA for beta hemoglobin was found in many different libraries, but its expression was the highest in the leukocyte library. You can also see that not only is this gene expressed in different tissues, its expression is different in some cancerous tissues than in the corresponding normal tissues. Selecting the link to the Brain displays the relative expression in different regions of the brain, as well as in different types of brain cancers. As you can see, differential expression can be seen here as well. Now you have been introduced to the world of transcriptomics, the study of simultaneous gene expression in whole genomes. We have focused on the use of SAGE, a sequencingbased approach that quantifies small, ~10 nucleotide tags of different mRNAs. CGAP has many other tools that utilize different approaches than SAGE, which you could explore from the CGAP home page at http://cgap.nci.nih.gov/ Also, Johns Hopkins University has a site dedicated to SAGE that you might want to investigate, at: http://www.sagenet.org/ Questions - Review 1. What, specifically, is meant by the term 'gene expression'? Can you have 'gene expression' of a particular gene, but not make the encoded protein? Explain. 2. Why is it necessary to regulate gene expression in a cell? 3. Define the word 'transcriptomics'. Questions - Thought and Application 1. You are carrying out the analysis of a particular tissue type, and you come across the following SAGEtag as one that is highly expressed in your tissue. AGTGTGTGGA a. What protein does this SAGEtag correspond with? b. What kind of tissue are you most likely studying? 2. Use UniGene to investigate the gene expression of two proteins that you are familiar with. These should be proteins whose expression you should be able to predict something about. For example, you might pick insulin -a hormone secreted by the pancreas that is important in regulating blood glucose levels. For the proteins you choose, answer the following questions: a. What is the name of the protein? b. Briefly, make predictions about the expression patterns that you expect to find for this protein. c. What is the UniGene symbol for your protein? What is the UniGene entry number? d. Are homologs found in other animals? Which animals, and what is the percent identity? e. Based on the Tag to Gene Mapping data, is the gene highly expressed in a particular library? Which one(s)? How many of the tags were found in those libraries, and how many total tags are in the library? 3. Use the xProfiler to design an experiment of your choosing, in which you compare gene expression in two different tissue types. For your experiment, describe the following: a. what libraries did you compare? b. What were some of the genes that were transcribed differently in the different libraries-did you recognize any of these genes? Based on what you know about the tissues, do these patterns of expression make sense?