Background and information about how .mgf files were generated

advertisement
Introductory Overview
Where do the mgf files come from? An overview of key
concepts and techniques in proteomics.
Chris Eurich and Peter Fields
This document is intended to provide a brief introduction to proteomics, the collection
of techniques that allows the detection and identification of many of the proteins
expressed by cells and tissues. The text is modified from laboratory handouts used in an
advanced biochemistry course at Franklin & Marshall College in Lancaster, PA, where
upper-level undergraduates perform a semester-long independent proteomic study of
the effects of environmental stresses on yeast. Mastery of the information presented
here is not necessary to perform the exercises we have designed, but we provide it for
those teachers and students who want to understand more fully how proteomics is
accomplished, and what the steps are that ultimately produce the mgf files you will use
to identify specific proteins. A more complete description of the protocols involved is
available in a separate document, labeled as “advanced” on the website.
We begin with an introductory section defining and describing proteomics, and then list
the steps that students in the yeast proteomics lab take to obtain protein samples for
mass spectrometry analysis. We finish up with a description of the procedures that
students should follow to complete the current activity, in order to identify and explore
the “unknown” proteins found in the mgf files.
I. Proteomics – what is it?
II. The proteomic workflow
III. Using Mascot search software to identify proteins from mgf files
I. Proteomics – what is it?
Perhaps the best way to define proteomics is to draw an analogy with a more familiar
branch of systems biology – genomics. As our ability to sequence DNA has increased,
the complete genomes of many species have become available. Genomic studies make
use of these complete gene sequences to describe the “blueprint” that controls the
structure and function of organisms. Genomic techniques allow researchers to look for
novel genes, or to compare genes in different species to determine how natural
selection has altered sequences.
Proteomics, instead of focusing on genes, attempts to provide a description of
gene products, that is, the entire protein complement in the cell or tissue. This
fundamental difference provides proteomic approaches with unique strengths, as well
as significant weaknesses. Because, with few exceptions, proteins carry out the
functions of a cell, a description of proteins present will provide the most accurate
1
Introductory Overview
picture of how a cell grows, metabolizes, and interacts with its environment. In contrast
to genomics, which tell us the proteins an organism potentially could produce,
proteomics describes the proteins that are really there.
However, proteomics, at least in its current form, also has serious drawbacks.
The most important is that the techniques currently in use do not allow us to capture
information about all, or even a majority of proteins in a cell. If we are lucky, we will be
able to detect perhaps 1000 – 1500 proteins, but a yeast cell probably expresses
between 10 and 20 times as many different proteins. Thus, despite the potential of
proteomics, we must keep in mind that our description is far from complete.
Ideally, the information available from genomics and proteomics, as well as
other “omics” such as transcriptomics and metabolomics (the complete description of
all metabolites present in a cell or tissue), can be combined to describe the function of
the system as a whole – leading to “systems biology” as a way to understand the
function of cells and organisms. This is currently is beyond the capability of all but a few
labs in the world. Nevertheless, biologists expect that this capability will grow, and will
provide new insights into the function of cells, tissues and organisms that would not
have been possible using older methods.
II. The proteomic workflow
Most proteomics projects aim to achieve two separate but related goals. The
first is to quantify the amount of as many proteins as possible from the samples of
interest (in our case, yeast that have been exposed to a particular stress). This process
is accomplished by two-dimensional gel electrophoresis (2-DGE), followed by
computer-based gel image analysis. The second goal is to identify some of the proteins
we find – in the upper level college biochemistry course that produced the data you will
use, the proteins of greatest interest were those that changed in abundance in response
to particular stresses. Protein identification is accomplished via tandem mass
spectrometry (MS/MS) ions searching, a process that combines digestion of proteins
into peptide fragments, measurement of the size of the fragments via mass
spectrometry (MS), and comparison of fragment sizes with databases of known
fragments in order to uniquely identify the target. Here is the workflow that students in
the lab used to accomplish these goals:
1) Growth of yeast under optimal or stress conditions – the students’ goal was to find
proteins that increase in abundance (are “up-regulated”) in response to some
environmental stress, such as heat or low oxygen, that the students were interested in
studying.
2) Extraction of protein from samples – Students separated proteins from the other
components of the yeast cell, attempting to retain as many of the proteins of the cell as
possible as other macromolecules were discarded.
2
Introductory Overview
3) Measurement of protein concentration – Students needed to know the concentration
of protein in our samples to load the appropriate amount on our gels.
4) First-dimension electrophoresis – To maximize the number of proteins we can see, we
must spread our proteins across a gel in two dimensions. The first dimension separates
proteins according to their isoelectric points (pI’s); pI essentially is the pH at which any
given protein has a net neutral charge. The technique by which this is accomplished is
termed isoelectric focusing (IEF).
5) Second-dimension electrophoresis – In the second dimension, proteins are separated
by mass. This is accomplished with the commonly used SDS – polyacrylamide gel
electrophoresis (SDS-PAGE) technique.
6) Staining and gel image capture – Once proteins were separated in two dimensions,
students stained the gels using colloidal Coomassie blue. Images of the gels were made
using a scanner.
7) Gel image analysis – In this step, students measured how many spots there were on
each gel, and determined how much protein was present in each spot. By comparing
gels from “control” yeast to “stress” yeast, students could find the proteins that
changed in amount in response to the stress.
8) In-gel digestion of proteins – Once protein spots of interest were found, students
digested the protein from some of these spots using trypsin, a protease found in the
intestines of most vertebrates (including us). We must do this because the proteins are
too large and complex to provide much information by themselves. By breaking them
into smaller pieces, we can get more information from the mass spectrometer regarding
the structure of the protein itself.
9) Mass spectrometry – We use a relatively advanced instrument – a high pressulre
liquid chromatography tandem mass spectrometer (HPLC-MS/MS) to measure the
masses of the peptides we created using trypsin. The mgf files you will use in this
activity basically are lists of these peptide masses.
10) Bioinformatics / MS/MS ions searching – This final step of the proteomics workflow
is what you will perform in this activity. You will be able to use a search engine
(MASCOT) that takes the peptide fragment masses produced by undergraduate students
who have completed the yeast proteomics lab, and uses them to identify the protein
from which they came.
[Note that if you would like more detailed descriptions of the procedures listed in steps
1-9 above, they can be found in the companion file entitled “Proteomics – background
and concepts (advanced).”]
3
Introductory Overview
III. Using Mascot search software to identify proteins from mgf
files
Trypsin digests proteins in a very specific and predictable manner – it cleaves the
peptide bond just after (C-terminal to) every lysine and arginine residue in a protein,
unless the next residue is a proline. This specificity is important, because it allows the
search software you will use (Mascot) to calculate how trypsin would digest every
protein sequence stored in bioinformatics databases. The basic trick of this protein
identification process is to compare peptide mass data produced by the mass
spectrometer with all of these “in silico” (that is, in the computer) digestions – if the
software is able to match the “real” peptides with a set of “in silico” peptides, then it
reports a match to the user.
Students in the yeast proteomics lab load their proteins of interest – digested by
trypsin – one at a time into the HPLC-MS/MS. The resulting output of the mass
spectrometer is a list of the masses of each of the peptides in the sample. (Actually, the
MS data are in the form of mass-to-charge ratios, m/z, but we can simplify and consider
the values to simply be peptide masses). This output is produced in the form of a
“mascot generic file,” or *.mgf.
The search software, Mascot, takes as its input the list of peptide masses and
intensities found in the mgf file. It also asks the user what bioinformatics database it
should use during its searches (for this activity, we will stick with a high-quality database
termed “Uniprot,” which is maintained by the European Bioinformatics Institute), as
well as what species is being studied. With this information, Mascot can use our mgf
inputs to attempt to identify the proteins the lab students isolated. This identification
procedure is the basis of the activity we have developed, and the step-by-step
procedure to use Mascot to identify proteins is described in an accompanying student
worksheet and guide.
4
Download