Introductory Overview Where do the mgf files come from? An overview of key concepts and techniques in proteomics. Chris Eurich and Peter Fields This document is intended to provide a brief introduction to proteomics, the collection of techniques that allows the detection and identification of many of the proteins expressed by cells and tissues. The text is modified from laboratory handouts used in an advanced biochemistry course at Franklin & Marshall College in Lancaster, PA, where upper-level undergraduates perform a semester-long independent proteomic study of the effects of environmental stresses on yeast. Mastery of the information presented here is not necessary to perform the exercises we have designed, but we provide it for those teachers and students who want to understand more fully how proteomics is accomplished, and what the steps are that ultimately produce the mgf files you will use to identify specific proteins. A more complete description of the protocols involved is available in a separate document, labeled as “advanced” on the website. We begin with an introductory section defining and describing proteomics, and then list the steps that students in the yeast proteomics lab take to obtain protein samples for mass spectrometry analysis. We finish up with a description of the procedures that students should follow to complete the current activity, in order to identify and explore the “unknown” proteins found in the mgf files. I. Proteomics – what is it? II. The proteomic workflow III. Using Mascot search software to identify proteins from mgf files I. Proteomics – what is it? Perhaps the best way to define proteomics is to draw an analogy with a more familiar branch of systems biology – genomics. As our ability to sequence DNA has increased, the complete genomes of many species have become available. Genomic studies make use of these complete gene sequences to describe the “blueprint” that controls the structure and function of organisms. Genomic techniques allow researchers to look for novel genes, or to compare genes in different species to determine how natural selection has altered sequences. Proteomics, instead of focusing on genes, attempts to provide a description of gene products, that is, the entire protein complement in the cell or tissue. This fundamental difference provides proteomic approaches with unique strengths, as well as significant weaknesses. Because, with few exceptions, proteins carry out the functions of a cell, a description of proteins present will provide the most accurate 1 Introductory Overview picture of how a cell grows, metabolizes, and interacts with its environment. In contrast to genomics, which tell us the proteins an organism potentially could produce, proteomics describes the proteins that are really there. However, proteomics, at least in its current form, also has serious drawbacks. The most important is that the techniques currently in use do not allow us to capture information about all, or even a majority of proteins in a cell. If we are lucky, we will be able to detect perhaps 1000 – 1500 proteins, but a yeast cell probably expresses between 10 and 20 times as many different proteins. Thus, despite the potential of proteomics, we must keep in mind that our description is far from complete. Ideally, the information available from genomics and proteomics, as well as other “omics” such as transcriptomics and metabolomics (the complete description of all metabolites present in a cell or tissue), can be combined to describe the function of the system as a whole – leading to “systems biology” as a way to understand the function of cells and organisms. This is currently is beyond the capability of all but a few labs in the world. Nevertheless, biologists expect that this capability will grow, and will provide new insights into the function of cells, tissues and organisms that would not have been possible using older methods. II. The proteomic workflow Most proteomics projects aim to achieve two separate but related goals. The first is to quantify the amount of as many proteins as possible from the samples of interest (in our case, yeast that have been exposed to a particular stress). This process is accomplished by two-dimensional gel electrophoresis (2-DGE), followed by computer-based gel image analysis. The second goal is to identify some of the proteins we find – in the upper level college biochemistry course that produced the data you will use, the proteins of greatest interest were those that changed in abundance in response to particular stresses. Protein identification is accomplished via tandem mass spectrometry (MS/MS) ions searching, a process that combines digestion of proteins into peptide fragments, measurement of the size of the fragments via mass spectrometry (MS), and comparison of fragment sizes with databases of known fragments in order to uniquely identify the target. Here is the workflow that students in the lab used to accomplish these goals: 1) Growth of yeast under optimal or stress conditions – the students’ goal was to find proteins that increase in abundance (are “up-regulated”) in response to some environmental stress, such as heat or low oxygen, that the students were interested in studying. 2) Extraction of protein from samples – Students separated proteins from the other components of the yeast cell, attempting to retain as many of the proteins of the cell as possible as other macromolecules were discarded. 2 Introductory Overview 3) Measurement of protein concentration – Students needed to know the concentration of protein in our samples to load the appropriate amount on our gels. 4) First-dimension electrophoresis – To maximize the number of proteins we can see, we must spread our proteins across a gel in two dimensions. The first dimension separates proteins according to their isoelectric points (pI’s); pI essentially is the pH at which any given protein has a net neutral charge. The technique by which this is accomplished is termed isoelectric focusing (IEF). 5) Second-dimension electrophoresis – In the second dimension, proteins are separated by mass. This is accomplished with the commonly used SDS – polyacrylamide gel electrophoresis (SDS-PAGE) technique. 6) Staining and gel image capture – Once proteins were separated in two dimensions, students stained the gels using colloidal Coomassie blue. Images of the gels were made using a scanner. 7) Gel image analysis – In this step, students measured how many spots there were on each gel, and determined how much protein was present in each spot. By comparing gels from “control” yeast to “stress” yeast, students could find the proteins that changed in amount in response to the stress. 8) In-gel digestion of proteins – Once protein spots of interest were found, students digested the protein from some of these spots using trypsin, a protease found in the intestines of most vertebrates (including us). We must do this because the proteins are too large and complex to provide much information by themselves. By breaking them into smaller pieces, we can get more information from the mass spectrometer regarding the structure of the protein itself. 9) Mass spectrometry – We use a relatively advanced instrument – a high pressulre liquid chromatography tandem mass spectrometer (HPLC-MS/MS) to measure the masses of the peptides we created using trypsin. The mgf files you will use in this activity basically are lists of these peptide masses. 10) Bioinformatics / MS/MS ions searching – This final step of the proteomics workflow is what you will perform in this activity. You will be able to use a search engine (MASCOT) that takes the peptide fragment masses produced by undergraduate students who have completed the yeast proteomics lab, and uses them to identify the protein from which they came. [Note that if you would like more detailed descriptions of the procedures listed in steps 1-9 above, they can be found in the companion file entitled “Proteomics – background and concepts (advanced).”] 3 Introductory Overview III. Using Mascot search software to identify proteins from mgf files Trypsin digests proteins in a very specific and predictable manner – it cleaves the peptide bond just after (C-terminal to) every lysine and arginine residue in a protein, unless the next residue is a proline. This specificity is important, because it allows the search software you will use (Mascot) to calculate how trypsin would digest every protein sequence stored in bioinformatics databases. The basic trick of this protein identification process is to compare peptide mass data produced by the mass spectrometer with all of these “in silico” (that is, in the computer) digestions – if the software is able to match the “real” peptides with a set of “in silico” peptides, then it reports a match to the user. Students in the yeast proteomics lab load their proteins of interest – digested by trypsin – one at a time into the HPLC-MS/MS. The resulting output of the mass spectrometer is a list of the masses of each of the peptides in the sample. (Actually, the MS data are in the form of mass-to-charge ratios, m/z, but we can simplify and consider the values to simply be peptide masses). This output is produced in the form of a “mascot generic file,” or *.mgf. The search software, Mascot, takes as its input the list of peptide masses and intensities found in the mgf file. It also asks the user what bioinformatics database it should use during its searches (for this activity, we will stick with a high-quality database termed “Uniprot,” which is maintained by the European Bioinformatics Institute), as well as what species is being studied. With this information, Mascot can use our mgf inputs to attempt to identify the proteins the lab students isolated. This identification procedure is the basis of the activity we have developed, and the step-by-step procedure to use Mascot to identify proteins is described in an accompanying student worksheet and guide. 4