SAGE - Humboldt State University

advertisement
Mark S. Wilson
Humboldt State University
Mining Genomes Bioinformatics Tutorial: Gene Expression
Introduction
Genomics studies have indicated that there are about 30,000 different genes in the human
genome, and, because of alternate RNA splicing possibilities, these can make about
100,000 different proteins. Not all of these proteins are made in any particular tissue, and
the amount of a protein made in one tissue varies with different developmental and
physical states. For example, consider the changes that occur during the course of
pregnancy, in both the mother and the developing child. The regulation and coordinate
expression of genes occurs with an exact and complex timing and progression.
In this chapter, you studied the regulation of the lac operon in Escherichia coli. Many
groups of researchers worked for years to elucidate the details of how this operon is
regulated. Today, new technologies are being developed that allow researchers to easily
and rapidly study the coordinate expression of thousands of genes simultaneously. In this
tutorial, you will be introduced to one of these technologies, named SAGE (serial
analysis of gene expression).
Of all the new '-omics' (genomics, proteomics, lipomics), the most exciting and the most
promising is the field of transcriptomics—the study of gene expression. Some of the
very interesting questions that can be asked with these techniques include basic questions
about how our bodies work as well as specific questions concerning the causes and
development of diseases. Some of these questions are:
- how many genes are expressed in different tissue types?
- how many genes are expressed at a high level in different tissue types?
- what do we mean by a high level of expression, anyway --10 transcripts/cell? 100
transcripts/cell? 10,000?
- what fraction of the total genes are expressed in a given cell type?
- how does gene expression change as a cell progresses through the cell cycle?
- how does gene expression differ in a disease state? For example, how does gene
expression in normal brain cells differ from gene expression in a brain tumor?
Many other types of questions can be asked as well. For example, a large fraction of the
total genes in the human genome do not have an assigned function; by understanding how
a gene is regulated and comparing its expression to known genes, it may be possible to
get insights into its function. And if the transcriptome of normal and diseased states is
known, it may be possible to individually tailor a treatment program for a patient by
monitoring how gene expression changes during the course of treatment, and responding
accordingly.
Objectives
-to introduce the use of SAGE (serial analysis of gene expression)
- to introduce the tools available at NCBI for examining gene expression, including Gene
Expression Omnibus, SAGEmap, and Unigene.
Project
We are going to look at SAGE (serial analysis of gene expression), a powerful means of
studying gene expression. This is a method that can be used to quantitatively study the
expression of thousands of genes simultaneously. During this portion of the project, you
can design hypotheses and ask your own questions about patterns of gene expression.
Serial Analysis of Gene Expression
Microarrays are one approach that can be used to study gene expression (details are
available in your text). SAGE is a clever approach that uses a method fundamentally
different than that used in microarrays. While microarrays rely on hybridization, SAGE
is a sequencing-based approach. A pictorial description of this method can be found at
the site
http://www.sagenet.org/findings/index.html
Essentially, molecular methods are used to isolate a single, specific 10-14 base-pair
fragment of every mRNA that is expressed in a cell. The unique fragment of each mRNA
is called the SAGEtag of that transcript. These fragments are then linked together into
long strands such that each of the fragments is separated by a short 'punctuation'
sequence. The nucleotide sequence of one of these strands is determined using highthroughput sequencers. If a gene is turned on, then you will be able to find the unique
SAGEtag of its 10-14 bp sequence -- and the number of times that you find that SAGEtag
is a direct measure of the abundance of that transcript! This method is slower than
microarray hybridization, but as sequencing methods improve it will become more
manageable as a research tool.
Click here to go the National Center for Biotechnology site, where you will be able to
analyze SAGE data.
http://www.ncbi.nlm.nih.gov/
Gene Expression Omnibus - GEO
On the right-hand side of the page is set of links under the heading Hot Spots. Select the
link for Gene expression omnibus.
GEO is a site dedicated to providing public access to gene expression data. In addition to
serving as a database and repository of different sets of gene expression data, it provides
a set of tools that can be used to query the database, or to analyze gene expression data
that you have generated. The design of these tools is a challenging undertaking because
the data comes from many different labs in many different formats, and is generated
using a variety of approaches. Like many of the new fields in bioinformatics, it is likely
that the development of flexible intuitive tools will rapidly increase knowledge and
understanding in ways that the tool developers frequently can't predict.
SAGEmap
There is a bar on the right of the page that has links to a variety of tools available at the
GEO site. Select SAGEmap.
This site is a public repository of SAGE data. Quantitative, whole-genome gene
expression profiles, called libraries, are stored here for researchers to analyze. These
libraries are primarily from human tissues, and include specific tissues such as brain,
colon, kidney, and pancreas. They also include similar tissues that were collected from
different sources, such as males and females, diseased and normal tissue, or individuals
of different ages. In some cases, gene expression profiles are from pooled tissue samples
(for example, cancerous ovarian tissues from young women) in order to reduce effects
from individual variations. Because the data is submitted while the projects are ongoing,
some of the libraries are more complete than others.
Using SAGE resources at NCBI- xProfiler
We are going to start by using a program called xProfiler. This program allows you to
pick two different tissue types and compare total gene expression in those tissues. For
example, you could compare gene expression in normal brain tissue and in cancerous
brain tissue. Or, you might choose to compare gene expression in brain tissue from a
developing fetal female and a young adult female.
Along the right side of the page, select the labeled Analyze...by library. This will take
you to the xprofiler site
The samples that you compare are simply referred to as Groups A and B. There are a
variety of options in choosing the samples for comparison. You could select samples
from the list on this page. Alternatively, you could choose the libraries from a full list of
the available libraries, by using the link to the Library Browser that is located on the top
right of the page.
We are going to select two different libraries that should have some clearcut differences
in gene expression - leukocytes (white blood cells) and epithelial tissue. At the top of the
page, enter white blood cell in the box for 'Group A name' and epithelial in the box for
group B name.
In Column A, under the heading Homo sapiens, check the box for
SAGE Duke leukocyte (48523 tags)
Bulk tissue, blood, normal human adult leukocyte total RNA
note that the libraries are listed in alphabetical order.
The (48523 tags) portion of the entry refers to the total number of SAGEtags that have
been sequenced, which is also the total number of specific transcripts that have been
analyzed. To get 48,000 tags probably required about 1500 different sequencing
reactions. Note that the number of tags may have changed, if more sequences have been
obtained for this library, but the rest of the library name should be the same.
Next, select two libraries for your B sample by checking the boxes in the B column for
SAGE NC1 (50179 tags)
Bulk tissue, normal colonic epithelium
SAGE NC2 (49593 tags)
Bulk tissue, normal colonic epithelium
Now, go to the top of the page. The xProfiler program will compare the A library with
the B libraries, and return to you a list of SAGEtags that are present at different
frequencies in the different libraries. The default setting for the difference in frequencies
is a factor of two. For our purposes, we want to change this value to 10.
After changing the default to 10, press Calculate (at the bottom of the page).
The xProfiler output will load in the same window. If your query gets queued, then you
may need to select the button labeled GO. This may take up to a few minutes, depending
several factors such as how much activity the site is getting.
The SAGEtags are presented as a table with six columns. The first column is the
nucleotide sequence of the tag. This is followed by three columns of numbers. The first
column of numbers is the number of times that SAGEtag was found in library A while
the second column of numbers is the number of times that SAGEtag was found in library
B. The next column is a statistical value to help evaluate whether or not the observed
difference in expression is due to sampling error, and the fifth column is the Unigene
cluster (more on that below) that matches the SAGEtag.
The final column in the xProfiler output gives a description of the gene that matches the
particular SAGEtag in that row. If the gene has been well characterized, then these
descriptions are very specific, however, the descriptions are somewhat vague for genes
that haven't been studied very thoroughly. Many of these genes will of course be new to
you, but some of them will be familiar. Scroll down the list slowly, and look for proteins
that you know. While you are doing this, also try to think about whether or not that
pattern of gene expression makes sense, given what you know about leukocytes (white
blood cells) and colons.
For example, you should see that keratin and cytokeratin are expressed more highly in the
colonic tissue, and hemoglobin alpha and hemoglobin beta are expressed more highly in
the blood cells. Recall that keratin is a nonenzymatic protein that has structural role in
epithelial cells (and makes up the bulk of animal hair); a higher rate of expression would
be expected in the colonic tissue. Similarly, hemoglobin is an oxygen-carrying protein
present in blood that gives red blood cells their characteristic color. The higher rate of
expression in this case is not wholly unexpected, but does raise some questions. Is
hemoglobin made in white blood cells? Or, was the white blood cell sample from which
this library was derived contaminated with red blood cells?
Look at the last column and find a protein that you have heard of, near the top of the list.
For our purposes, find 'hemoglobin, beta'. Select the tag sequence of that tag by clicking
on the words 'More Unigenes...' below the tag sequence.
Using SAGE resources at NCBI- SAGE Tag to Gene Mapping
This takes you to a site called the SAGE Tag to Gene Mapping page. This site gives you
details on all of the different SAGE libraries that this tag sequence has been found in,
including the library name and information on the frequency with which that tag is
represented in the library. The simplest way to quickly assess this information is to look
at the shaded ovals in the third column. The darker ovals represent high rates of
expression, the light ovals represent low levels of expression, while the gray-shaded ovals
represent intermediate levels of expression. Because the different libraries are at
different levels of completeness, this data is normalized to the expected number of tags
that would be found in a library containing a million total tags. The shading is meant to
simulate the type of results you might see if you were to carry out a quantitative
hybridization analysis, comparing the expression of that gene in the different libraries.
The 'Tag counts' and 'Total tags' columns describe how many times the tag has been
found in that library, and how many total tags have been found in the library. As of this
writing, the only SAGE library that this tag has been found in more than once is the
SAGE Duke leukocyte library.
At the top of the page is the Unigene Identification number for this gene, in this case
Hs.155376. UniGene is a system for partitioning GenBank sequences into a nonredundant set of gene and gene-like entries. Whereas a particular gene may have been
sequenced and entered into GenBank a number of different times, from different
organisms and by different researchers, that gene will have only one entry in Unigene.
This Unigene entry will list the product and function of the gene, if known. The Unigene
entry also has information such as the tissue types in which the gene has been expressed,
and the map location of the gene.
The xProfiler program allows you to compare the expression of all the genes in different
tissue samples. As the number of SAGE libraries increases, you can also ask questions
about individual genes, for example, concerning which tissues a gene is highly expressed
in and which tissues a gene is not expressed in. One of the ways that this can be done is
using the SAGE Tag to Gene Mapping site, which you are at now. A more
comprehensive look at individual genes can be obtained at the Unigene site for a gene.
Unigene
Select the UniGene link on the tool bar at the top of the page.This takes you to a page
that has a description of Unigene, as well as a search function. Look over the descriptive
material. Then , in the search box enter hemoglobin. This search will give you several
hits. Select the UniGene entry for hemoglobin beta (symbol = HBB; Entry # Hs.155376).
This takes you to the UniGene entry for this gene.
At the top of the entry are links to other NCBI sites that have information on this gene,
such as LocusLink and OMIM (Online Mendelian Inheritance in Man). Below this are
similar proteins that are found in other organisms, such as mice (Mus musculus), and
information about how similar the proteins are. For example, the beta hemoglobin of the
mouse is 81% identical to the human beta hemoglobin. Below this, in the Mapping
Information section, the location of the gene in the genome is identified and linked.
Below this is a summary of gene expression information, including SAGE data.
The Cancer Genome Anatomy Project
Many of the SAGE projects have been carried out as a part of the Cancer Genome
Anatomy Project (CGAP), a large effort to determine the differences in gene expression
between normal and cancerous tissues and cells. We will finish up this project by using a
few of the CGAP tools. To get to the CGAP home page, select the NCBI icon at the top
of this page, then select the link to The Cancer Genome Anatomy Project at the top
right of the NCBI home page. This page describes some of the different ways that NCBI
is helping to make CGAP data available to the public. From this page, select the link to
The Cancer Genome Anatomy Project in the first sentence. The URL for the CGAP
home page is:
http://cgap.nci.nih.gov/
There are several interesting links available from this page. Select the link to SAGE
Genie.
SAGE Genie
The SAGE Genie site lets you query the SAGE libraries in some ways that xProfiler
doesn't let you. For example, you can input a specific gene, and then ask how that gene is
expressed in different tissues. The output shows expression levels in both normal and
cancerous tissues. Let's do this.
Select The SAGE anatomic viewer from the middle of the page.
In the box next to " Tag (sequence of 10 bases)", enter the Tag for beta hemoglobin,
GCAAGAAAGT. Then click Go.
At the top of page, select the drawing of the human figure, under the words 'SAGE
Anatomic Viewer?' This will return the data for tissue samples.
The output portrays the different tissue systems of the human body. The color associated
with each tissue is keyed to the relative level of expression in that tissue. Again, you can
see that the mRNA for beta hemoglobin was found in many different libraries, but its
expression was the highest in the leukocyte library. You can also see that not only is this
gene expressed in different tissues, its expression is different in some cancerous tissues
than in the corresponding normal tissues.
Selecting the link to the Brain displays the relative expression in different regions of the
brain, as well as in different types of brain cancers. As you can see, differential
expression can be seen here as well.
Now you have been introduced to the world of transcriptomics, the study of simultaneous
gene expression in whole genomes. We have focused on the use of SAGE, a sequencingbased approach that quantifies small, ~10 nucleotide tags of different mRNAs. CGAP
has many other tools that utilize different approaches than SAGE, which you could
explore from the CGAP home page at
http://cgap.nci.nih.gov/
Also, Johns Hopkins University has a site dedicated to SAGE that you might want to
investigate, at:
http://www.sagenet.org/
Questions - Review
1. What, specifically, is meant by the term 'gene expression'? Can you have 'gene
expression' of a particular gene, but not make the encoded protein? Explain.
2. Why is it necessary to regulate gene expression in a cell?
3. Define the word 'transcriptomics'.
Questions - Thought and Application
1. You are carrying out the analysis of a particular tissue type, and you come across the
following SAGEtag as one that is highly expressed in your tissue.
AGTGTGTGGA
a. What protein does this SAGEtag correspond with?
b. What kind of tissue are you most likely studying?
2. Use UniGene to investigate the gene expression of two proteins that you are familiar
with. These should be proteins whose expression you should be able to predict
something about. For example, you might pick insulin -a hormone secreted by the
pancreas that is important in regulating blood glucose levels. For the proteins you choose,
answer the following questions:
a. What is the name of the protein?
b. Briefly, make predictions about the expression patterns that you expect to find for this
protein.
c. What is the UniGene symbol for your protein? What is the UniGene entry number?
d. Are homologs found in other animals? Which animals, and what is the percent
identity?
e. Based on the Tag to Gene Mapping data, is the gene highly expressed in a particular
library? Which one(s)? How many of the tags were found in those libraries, and how
many total tags are in the library?
3. Use the xProfiler to design an experiment of your choosing, in which you compare
gene expression in two different tissue types. For your experiment, describe the
following:
a. what libraries did you compare?
b. What were some of the genes that were transcribed differently in the different libraries-did you recognize any of these genes? Based on what you know about the tissues, do
these patterns of expression make sense?
Download