Structural genomics of conserved gene families By Ram Mani Advisor: Gaetano T. Montelione, Ph.D. Center for Advanced Biotechnology and Medicine; Department of Molecular Biology & Biochemistry, Rutgers, The State University of New Jersey -Piscataway, NJ In fulfillment of requirements for the Henry Rutgers Scholars Program May 2000 Acknowledgements I have been working in Dr. Gaetano Montelione's laboratory since May 1997, which makes the number of persons to thank very large. If someone is excluded please forgive me. Members of Dr. Montelione's lab have been tremendously supportive. Rong Xiao has taught me so much during my protein expression and purification work. Daphne Palacios has been an excellent teacher during much of the molecular biology work of my COG project. Dr. Kristin Gunsalus provided essential coordination during my work with COG's, and her protein expression kit has been a valuable resource. Several others (e.g. Drs. Albert Chien and Parag Sahasrabudhe) were always willing to answer my questions about DNA and protein preparation. Without the work of Alexandra Gardino and Charles Lu, my work on COG 0229S would not have been possible. Emily Ly was kind enough to share in my molecular biology work in COG 0316S. Gurmukh Sahota has been very gracious in helping me in times of need. Dr. Montelione and Dr. Stephen Anderson have been kind enough to serve on my thesis presentation committee and deal with delays that came with the composition of this thesis. Both have provided valuable assistance during my computational work. Last, but not least, I would like to recognize the overall direction that Dr. Montelione has provided me during all of my projects. His intelligence and creativity and been large motivating factors for my work. Table of Contents Chapter 1 Introduction Experimental Methods Computational Biology Molecular Biology Biochemistry Results and Discussion Conclusion and Future Work References Legends Figures and Tables Chapter 2 Abstract Introduction Experimental Methods Results and Discussion References Table Figure Legends Figures Page 1 5 10 20 30 41 43 45 unpaginated 53 54 55 59 66 70 75 unpaginated Preface to the thesis This thesis was written as two chapters, each written independently of the other. Chapter one describes work that I have done during investigation of three domain families of unknown function -- domains 0011, 0229, and 0316. Many of the gene products that populate these families are from predicted open reading frames from genome sequencing projects. The basis for these domain families was a gene product classification scheme called "clusters of orthologous groups of proteins (COG's)" (Tatusov, R.L., et al., 1997). What makes some of these domain families special is that they are conserved among several organisms of different phlya. For those families that are represented in ancient and recent organisms of evolution, Darwinian theory allows us to assume that they may be fundamental to life. Computational biology, molecular biology, and biochemistry techniques are being used to intelligently choose targets within these domain families and ultimately discover the biochemical functions of domains 0229 and 0316. We believe that by understanding the structure of each of these domains, we stand a very good chance of understanding their function. The study described in chapter two partially justifies our structure-based approach to functional-genomics. The study probes the question: if a structure was obtained of a protein of unknown function and having no homologues detected via sequence comparison, would structural comparison to the Protein Data Bank structures result in useful, functional insight? By simulating this scenario, we see that structural comparison provides useful clues when sequence comparison fails for eight out of ten randomly chosen cases. This result gives more reason to allow the work of chapter one to be one of the foundations for a structural genomics approach of significant potential. Chapter 1: Structural genomics of conserved gene families. Introduction At a March 1999 press conference, Vice President Al Gore said: I am extremely pleased that the Human Genome Project (HGP) has accelerated efforts to complete one of the most important scientific projects in human history -- unlocking the secrets of the genetic code. The Project will forever change how we understand the human body and disease, leading to improved prevention, treatments, and cures for what are currently medical mysteries (http://www.ornl.gov/hgmis/project/update.html). Gore revealed a fundamental point in his statement: the tremendous scientific and medical benefits that we will gain from the genome sequencing projects of humans and other organisms rest upon how we choose to understand the data obtained from these projects (Tilghman, S.M, 1996). To this end, our lab has chosen to contribute to the interpretation of this data by using nuclear magnetic resonance (NMR) spectroscopy to determine the protein structures coded by genes shared by different organisms. Target selection and structure determination applied across one or more genomes represents a new area of science called structural genomics (proteomics). Structural genomics (Montelione, G.T. & Anderson, 1999) is the field concerned with determining the three-dimensional structures of proteins coded by genomes and using this structural insight to understand the biochemical and cellular functions of these proteins, i.e. functional genomics (Feng, W., et al. , 1998; Hwang, K.Y., et al., 1999). However, with this "genome approach" to structure determination, one can easily be overwhelmed by the tremendous number of potential protein targets. The human genome alone consists of eighty thousand to one hundred thousand genes, and the process of determining a protein structure by NMR or X-ray crystallography traditionally takes months to years. We have devised a set of criteria that may result in the discoveries of protein structures of significant impact on the structural biology community. Specifically, we consider: gene products that show significant sequence similarity (homology) to gene products of different organisms, with the organisms being evolutionary diverse and some being Metazoan (e.g. H. sapiens, M. musculus, C. elegans); gene products that are of suitable size and predictive solubility for NMR spectroscopy; and gene products that have no known biochemical/cellular function. The first criterion allows that the gene may be fundamental for life since it is found in a majority of the different kingdoms of life, no matter how primitive or sophisticated. Another benefit of the first criteria is that investigating a "family of homologues" can be done in a parallel manner; thus, if one protein shows poor biophysical properties, work on the family can be continued because "cousin" proteins may provide better results. An excellent foundation for our investigation is clusters of orthologous groups of proteins (COG's). This is a classification scheme (Tatusov, R.L., et al., 1997) of homologous genes from an evolutionary diverse set of organisms whose genomes were completely sequenced. A COG is defined as a set of 3 or more evolutionarily diverse best hits from genome versus genome sequence comparisons. A fundamental assumption is that each COG represents a structural domain family. A structural domain is a piece of a protein that folds independently and has been proposed to be a protein module whose DNA moves throughout the course of evolution (Patthy, L., 1996). Because many domains retain stability without their parent proteins, many structural studies examine a protein one isolated domain at a time. NMR structure studies often are focused on single structural domains rather than multi-domain proteins because of the 30 kDa size limitations for NMR data interpretation. In addition to intelligent target selection, the structural genomics effort depends upon rapid and high quality structural determination methods. X-ray crystallography and NMR spectroscopy are two complementary methods. X-ray crystallography makes use of X-ray diffraction upon crystallized macromolecules to determine the structures of those molecules. Proteins in the range of several of hundreds of kilodaltons molecular weight can be examined through X-ray crystallography. Although a major bottleneck is the crystallization of these macromolecules, burgeoning techniques and technology are making the rapid acquisition of high-quality, atomic structures realistic for X-ray crystallography (Terwilliger, et al., 1998) NMR complements the role of X-ray crystallography by allowing for the investigation of molecules in their native, solution state. NMR also allows structural investigation for some of those proteins that do not crystallize. Rapid acquisition of high-quality, atomic structures is closer to reality due to a multitude of developments in technique and technology (Montelione & Anderson, 1999). The bulk of the data for the structure will come from triple resonance experiments. Software (SPARKY: http://www.cgl.ucsf.edu/home/sparky/) from a collaborating group allows for rapid acquisition of peaks from NMR spectra. In addition, our lab has developed software to automate the task of derivation and analysis of resonance assignments (Montelione, G.T., et al., 1999). With high-resolution structures, we will be in a better situation to test the biochemical/cellular function of domains of unknown function. Even though the domains being investigated show no sequence similarity to proteins of known function, similarities that the structure may show to protein structures of known function can act as a guide in testing function. This may be possible because structure is preserved much better than sequence throughout the course of evolution (Holm, L. & Sander, C., 1996; Levitt, M. & Gerstein, M., 1998; Holm, L. & Sander, C.,1997; Lima, C.D., et al., 1997). The benefits from this project can be many. We can begin to understand the biochemical/cellular function of conserved domains. If the structures of these domains are dissimilar to anything known -- the probability of this enhanced because the lack of sequence similarity to anything of known structure -- they may serve as members of the collection of known protein folds existing in nature. The members of this database, which is estimated to be composed of some one thousand to two thousand unique folds, can be used as structural and functional models for the many sequences coming from the genome projects (Chothia, C., 1992). This will assist in providing novel templates for homology-modeling projects. In addition, because many of these COG domains are present in so many different organisms, knowledge gained about them will benefit evolutionary studies. The presence of these domains in model organisms can be exploited to elucidate the functional roles of these domains through experiments designed for those systems (e.g. RNA interference in C. elegans). With this is mind, work on two of three domain families has supplied us with candidates for structural determination projects. COG 0011S showed no representation in Metazoa, and therefore work on it has been stopped after the computational biology stage. Protein purification and biophysical studies for COG 0229S will be examined in this thesis. Experiments show this domain to be a candidate for structural determination via NMR. Computational biology, molecular biology, protein expression, and protein purification studies for COG 0316S will be also be examined. Experiments show that this domain may not be suitable for NMR, but rather a candidate for X-ray crystallography. Experimental Methods Computational Biology A list of COG's of potential suitability for NMR was provided to the lab (Eugene Koonin, NIH). These 15 COG's contain conserved regions 49 to 382 residues in length (the majority are less than 200) and possessed no predicted transmembrane regions. They were of the R or S classes -- function not well understood or function unknown, respectively. The following two COG's (0011S and 0316S) were subjected to a general computational biology protocol (figure 1). These two COG's were among the first analyzed and expanded by the lab. They were the chosen first because of their apparent short domain lengths. A step-by-step listing of the protocol is found at: http://www-nmr.cabm.rutgers.edu/bioinformatics/cogs/protocol/protocol2.html Since this work, members of this lab have developed a more effective and efficient protocol. Acquisition and Initial Analysis of Original Sequences in COG COG protein sequences were obtained from the NCBI site (http://www.ncbi.nlm.nih.gov/COG/) and were analyzed for regions of conservation and potential domain ends, i.e. end or begin of a conserved region. This was done by creating a multiple sequence alignment (msa) of all the protein sequences listed under the COG. The algorithm and program CLUSTALW (Thompson J.D., et al., 1994) of the software suite NCSA Biology Workbench version 3.0 provided by the University of Illinois (http://biology.ncsa.uiuc.edu/) was used to generate the msa. To aid in visualization of conserved residues and regions within the msa, the program Boxshade 3.3.1 (Kay Hofmann and Michael D. Baron) in the same software suite was used to color conserved regions in and to derive a consensus sequence for the msa. Gathering of Possible Structural & Functional Information of COG The consensus sequence provided from the msa was used to search the Protein Data Bank for structural homologues. The PSI-BLAST online program (Altschul, F., et al., 1997) was used to search NCBI's local PDB sequence database (E score < 10, noncomplex regions remain unfiltered). No structural homologues were found. In addition, the consensus sequence and some of the original sequences were checked for transmembrane region features. No evidence of transmembrane regions was found for the COG 0316S proteins. Likewise, a search was conducted for functional implications of the COG proteins. Primary reliance was placed upon the Swiss-Prot database (Bairoch, A., 2000 -- http://www.expasy.ch/sprot/) because this is one of the better annotated, protein databases. When applicable, Medline abstracts and articles were consulted to discern function of proteins. Expansion of COG's After the initial analysis, the next step was to see the if the COG domain was represented in organisms with genomes that were not included in the COG scheme of Tatusov and colleagues, i.e. those genomes that are not completely-sequenced. It should be noted that this "expansion" of the COG actually breaks down the restrictions placed on the NCBI scheme. Tatusov and colleagues used the "best-hit among completely-sequenced genomes" criteria to insure that they were creating a "tightly-knit" group of proteins that were homologous to each other. By comparing completely-sequenced genomes, one could determine orthologues from paralogues. Since genomes not completely sequenced were examined in our expansion, we are not sure if orthologues or paralogues were added to expanded COG. This uncertainty has implications in conservation of function. However, our primary concern was to develop a more populated domain family. Our basis for adding homologous proteins containing a domain common to the COG were the following set of criteria: statistical significance of similarity between sequences (E score less than or equal to ~10-2, good alignment between sequences for the region assumed to constitute the domain, sequence being a neighbor rather than outlier in the phylogenetic tree of the COG, and sequence possessing the identical or similar residues that were well-conserved in the msa. To search for these possible additions to the COG, the program HMMER (Sean Eddy, http://hmmer.wustl.edu/) was used to search the nonredundant (nr) protein database of NCBI. HMMER works by creating a hidden Markov model statistical representation of an inputted multiple sequence alignment. This profile is then compared to each of the sequences in the queried database to look for possible matches. In addition to this search, PRODOM (Corpet, F., et al., 1998) and Prosite (Hoffman, K., et al., 1999) were searched to see if additional proteins may be added to the domain family. These homologues were also inspected for function by consulting the Swiss-Prot database and necessary literature. The non-redundant protein database lacks translated expressed sequenced tag (EST) sequence. Currently, the EST "shotgun" approach to sequencing is being used to sequence many of the genomes from eukaryotes. Thus, to consider many of the homologous proteins from higher organisms, one needs to search EST’s. Since HMMER cannot be used to search nucleic acid databases, BLAST was used to search the NCBI database of EST's (dbest). The consensus sequence from the most recent msa was used as the query sequence, which allows for the capture of some of the additional conservation information of the expanded COG rather than the original COG. A BLAST module that compares protein sequence to a db of nucleic acid sequences translated into the six possible gene products was used as the search program. EST’s statistically significant to the query (E score < 10-2) were searched against the NCBI nr and EST databases to see if overlapping nucleic existed. This was done because EST's are often not full length or may contain sequencing errors in relative high frequency. In addition, the Unigene database (Schuler, et al., 1996) of NCBI was consulted because it contains updated groups of overlapping EST's in the public domain for mouse and human. After all possible overlapping sequences were acquired, the assembler CAP (Huang, X., 1996) at the Baylor College of Medicine (BCM) website was used to build a more accurate and longer cDNA sequence. This assembled "contig" was then translated into the six possible gene products (with attention paid to stop codons) using the BCM 6-frame translater (http://dot.imgen.bcm.tmc.edu:9331/seq-util/seq-util.html). These resultant protein sequences were compared to the msa consensus to determine which translation may code for a possible homologue. These translations were accepted or rejected using similar criteria explained previously. Choosing to express domain With more information now available about this domain, we were in better position to choose whether it should undergo a structural determination project. It was decided to pursue domains that were: conserved in a large number of phylogenetically diverse organisms, found in Metazoa, and most appropriate for NMR (small size and no predicted coiled-coils). COG 0011S was excluded from further investigation because of its absence in Metazoa. COG 0316S fulfilled the previously listed criteria. Its domain termini were reevaluated by looking at the more highly populated msa to better determine which residues may be more critical for structural and/or functional integrity. With the philosophy that we should express several of the members’ domain sequences from the COG in case some of the proteins give problems during our investigation, six domain sequences were chosen for expression. This selection was done by inspecting the phylogenetic tree for "subgroups" of the domain sequences. The tree is a graphical representation of a distance matrix comparing the evolutionarily differences between pairs of sequences in the msa. The distance between sequences in the tree represent how evolutionarily different in sequence those sequences are; the nodes of the tree are an indication of divergence of sequences. Assuming that sequences within a branch are most similar to each other, we subgrouped sequences by which major branch they reside at. Attempts were made to choose at least one sequence from each major branch of the tree. Selection of sequences within a subgroup was done by availability of the DNA coding for that sequence, confidence in the quality and length of reported cDNA coding for that sequence, and preferences of the scientific community and/or the lab for study of the organism from which the DNA originates. Once the sequences to express were chosen, the domain termini were finalized. Previous predictions of domain termini and the revised msa were consulted. In addition, a secondary structure prediction of the sequences to express was conducted using the algorithm and program PHDsec (Rost, B. & Sander, C., 1993, 1994) using the online software suite PredictProtein (Rost, B., 1996). Domain termini were revised so as to not interrupt predicted secondary structure elements. Molecular Biology & Biochemistry ("Benchwork") The next phase of the project was to express, purify, and analyze several of the domain sequences for COG's. As explained before, it was decided to discontinue work on COG 0011S and to continue work on COG 0316S. Thus, the molecular biology protocol (Figure 2) will refer to steps taken in the preparation of expression vectors for the 316 domain sequences. DNA Acquisition & Preparation After determining which sequences to express from 316 domain family, DNA coding for the protein sequences was obtained. Genomic DNA (gDNA) for the B. subtilus, H. influenza, and S. cerevisiae sequences and cDNA for the C. elegans, M. musculus, and H. sapiens sequences were obtained (Table 1). The gDNA had already been isolated, and it was provided to us by neighboring labs (S. Anderson and S. Brill). cDNA was purchased from the vendor Genome Systems or acquired from institutions (Dr. Yuji Kohara of Japan's National Institute of Genetics). Since all cDNA was supported only by EST data, it was sequenced upon receipt. DNA Acquisition & Preparation: 316 C. elegans (CEle1) Isolating the cDNA for the C. elegans coding sequence required excision of the pBluescript SK phagemid (-) from the lambda ZAP vector. Methodology for this process -which makes use of the ExAssist helper phage -- is described in the instruction manual for the Lambda ZAP II Library (Stratagene). Once the pBluescript vector was isolated, the C. elegans cDNA insert contained in it was sequenced by using the T3 and T7 primers that border it. Sequence was verified against that assembled from the assembled EST's. This sequencing demonstrated that the insert contained coding sequence whose translated product contained the expected stop codon (i.e. it was complete in the portion coding for the C-terminus of the complete protein). The sequencing also demonstrated that the insert contained coding sequence upstream of that reported in the GenBank EST's. In other words, additional residues in the Nterminal portion of the protein were revealed. However, no methionines could be found in these residues. This leads to conclusion that although more of the N-terminal portion of the protein was revealed through our cDNA sequencing, the cDNA is probably too short to encode the entire coding sequence. Recently (after the creation the expression vectors), a predicted gene product almost identical to the 316 CEle1 sequence was deposited in the WormPep database (Sanger Center & Washington University C. elegans genome project database). This sequence contains a methionine residue at the N-terminus. This sequence also contains about 25 residues between this N-terminal methionine and the beginning of the region of conservation for 316 domain sequence (i.e. LTLT…). No signal peptide sequence could be found when running the C. elegans WormPep sequence through a signal peptide search program (PredictProtein). Thus, our choice for the beginning of the CEle1 domain may have been off by 20 residues if the structural domain begins at the immediate N-terminus of the complete gene product. DNA Acquisition & Preparation: H. sapiens (HSap2) The 316 HSap2 cDNA arrived in the pT7T3D-Pac vector (Pharmacia) contained in DH10B host bacteria -- all of this contained in an agar stab. The bacteria were plated, grown overnight in LB+amp media, and the high copy plasmid was isolated by the Qiagen Miniprep method. Sequencing was performed using T3 and T7 primers. The resultant sequence contained a coding sequence similar to the contig assembled from the EST’s. Errors in the contig sequence were corrected by consulting sequencing waveforms for both strands of the insert. The region of high-quality sequence of the insert were translated into a gene product that was: (1) very similar to the 316 domain sequence, (2) contained a stop codon at the expected position, and (3) did not contain a possible initiator methionine, but it did have at least forty residues N-terminal to the portion of the protein (IRLT…) that begins the region of high conservation for the 316 domain sequence. DNA Acquisition & Preparation: M. musculus (MMus2) Progress on the MMus2 sequence was limited. It was ordered from Genome Systems similarly as HSap2. However, upon receipt in the lab, the bacterial agar stab was stored at – 20°C. It should have been stored at +4°C or plated immediately. The freezing temperature killed the bacteria in the agar. A new clone was sent for sequencing after plating, growing the host cells overnight, amplification and purification of the plasmid containing the cDNA, and preparing a sequence mixture (primers and template). The sequence showed this clone to contain cDNA corresponding to a different gene; Genome Systems sent a clone that had an identification number different in one digit than that of the correct clone. After lengthy dialogue with Genome Systems, the company finally sent the correct clone. By this time, the project was well into the molecular biology stage. Thus, it was decided to postpone working on this domain sequence. Note on sequencing The DNA Synthesis and Sequencing Laboratory of UMDNJ-RWJMS conducted all DNA sequencing for this project. They employ a dye-terminator PCR method of sequencing. Sequencing readouts and waveforms were consulted as necessary. A waveform tracing is a plot of the intensity of the fluorescence associated with the dye on each nucleotide type versus the position of that nucleotide on the strand of DNA being sequenced. Design of PCR primers After the correct cDNA sequenced had been established, I was in a position to design PCR primers to clone the desired coding sequence. While designing the PCR primers it was also assumed that the gDNA for SCer2, BSub1, and HInf1 matched that in the GenBank database. Primers were designed such that we could extract a coding sequence that would retain the ability to be ligated into a multiplexed expression vector system developed by Dr. Kristin Gunsalus and colleagues (Gunsalus, K.C., et al., submitted). Gunsalus' expression system has the goal of creating a single PCR product that may undergo different sets of digests so that it may be ligated into nine expression plasmids (Figure 3). The plasmids differ in location of a hexa-histidine fusion tag that is exploited for purification purposes. They also differ in type of promoter employed during transcription of the coding sequence so that yield of recombinant protein could be optimized. Specific restriction sites needed to be designed into the 5' and 3' coding sequence for each domain sequence so that each PCR product would have the capability of being inserted into each of the nine vectors. During the course of the investigation, it was decided that 2 types of PCR product would be made for each domain sequence -- one containing the RE2 site and one excluding the RE2 site. This was done so that the resultant domain sequences expressed as the N-terminal hexahistidine tag or nonfusion constructs would retain their native sequences in C-terminal portion of the domain sequence. Thus, one forward primer and two reverse primers were designed for each domain. The following describes the naming scheme for PCR products: name-1 for the PCR product coding for the C-terminal hexa-histidine fusion tag domain sequences; name-2 for the PCR product coding for the N-terminal hexa-histidine fusion tag and the nonfusion domain sequences. For example, BSub1-2 refers to PCR product coding for the N-terminal hexa- histidine fusion tag and the nonfusion domain sequences of the BSub1 domain from COG 0316S. In addition to the proper restrictions sites being designed into each PCR primer: 1) insertion of 2 stop codons stop codons was designed into each reverse primer to protect against read-through by E. coli's translational machinery. 2) primers were made long enough and with high enough GC content so that proper, specific annealing could occur. Melting temperatures were in the range of 50 to 70°C. 3) Rare codons in the first approximately ten amino acids of the domain sequence were “designed out” for their more prevalent synonymous codons. This was done to prevent disruptions in the translational machinery, which may affect integrity or yield of protein. When these changes in primer versus template sequence occurred, approximately an additional 10 nucleotides were incorporated into the downstream sequence to insure proper annealing of the overall primer to the template. 4) The primer was extended to include a "CC", "GG", "CG", or "GC" -clamp at its 3' end. This was done to insure proper annealing at this critical area. Secondary structure formation in this area could cause mutations (of the addition type) in the PCR product. 5) Primer-check software at the Whitestone, Inc website was used to check the primer sequences for secondary structure, melting temperature, and primer-dimer formation. 6) The DNA to code for the domain sequences was checked for restriction sites commonly used in digesting products for ligation into vectors of the Gunsalus expression kit. In the case of the HSap2 domain sequence, only one PCR product was made because the restriction site (RE2) for the C-terminal hexa-histidine fusion tag accommodates the protein’s naïve sequence. BSub1, HInf1, and SCer2 also underwent primer design as described above. PCR and PCR Product Purification/Concentration Primers were ordered from GenoSys or the RWJMS DNA Synthesis/Sequencing Laboratory. Two separate PCR reactions (except for HSap2) were conducted for each 316 domain member to be expressed. The 100 ul PCR mixture consisted of: 1 ul template DNA (20 ng/ul), 1 ul of each primer (10 uM concentration), 10 ul of a 10X PCR Taq buffer, 8 ul of dNTP's, 1 ul of Taq Polymerase, and 78 ul of dH20. PCR was conducted in the following sequence: 10 cycles of 94°C for 1 minute, 45°C for 1 minute, and 72°C for 2 minutes; 25 cycles of 94°C for 1 minute, 60°C for 2 minutes, and 72°C for 5 minutes; and finally, 72°C for 5 minutes. The PCR mixture underwent electrophoresis on a 0.8% agarose gel, and the appropriate bands were excised from gel. The PCR product was purified according to the QIAEX II Agarose Gel Extraction protocol or the QIAQuick Gel Extraction protocol (Qiagen). Concentration of the product was determined by UV visualization on gel and comparison against standards. If it was too low, the PelletPaint protocol (Novagen) was used to concentrate the purified PCR product. Ligation, Amplification, and Purification of PCR product To amplify the PCR product and to allow more efficient digestion of the insert using the proper restriction enzymes, the PCR product was ligated into a high copy plasmid. A pBluescript plasmid with overhanging T's was the first choice, and a DNA Rapid Ligation kit and protocol was used. Considerable time (1-2 months) was spent trying to ligate the plasmid into this vector. Insert to plasmid molar ratio was increased in reactions from 3:1 to 6:1 to 9:1 and finally to 12:1. New PCR product was synthesized. Fresh Taq Precision Polymerase Plus enzyme and ligation mixture were used. Newly isolated pBluescript phagemid was also used. However, among four lab members, none were able to generate a successful reaction. Controlled tests could not determine what was the cause of failure. Consequently, a kit from Invitrogen that allows ligation of PCR products into a stable, high copy plasmid (pCRII-TOPO) and transformation of that plasmid into TOP10 cells was used. Instructions are found in the Invitrogen manual. By use of colony PCR and/or restriction digest analysis, it was confirmed that all PCR products (2 for each of the 5 organisms except H. sapiens) were ligated into the pCRII-TOPO plasmid. Each of the cloned inserts was sequenced to guard against errors in eventual expressed protein. M13 and T7 primers were used because they border the multiple cloning site of the pCRII-TOPO holding/amplification vector. The sequencing results showed point mutations in several of the inserts in the portion coded by the primers and no errors present elsewhere in the inserts. Table 2 shows example of errors that were found through sequencing of both strands of the ligated PCR product. There seems to be no pattern in the errors that would lend support to a sound explanation. It was hypothesized that a chemical modification (e.g. deamination) of a purine would tend to make it appear to the DNA polymerase more like the other purine (and similar explanation give for pyrimidine mutations). Consistent results supporting this hypothesis were not evident. In some cases, when the same primer was used in two different PCR reactions, different errors would occur in each of the reaction's PCR products. These error-ridden PCR products resulted from reactions in which primers were used from GenoSys and in reactions in which primers were used from the RWJMS DNA Synthesis Lab. After redoing some of the reactions and observing that HSap2, CEle1-1, CEle1-2, and BSub1-1 were the only PCR products of acceptable sequence, it was decided to focus on creation of expression vectors corresponding to these PCR products. The PCR products in the holding/amplification vector were allowed to amplify in E. coli overnight at 37°C in 100 ml of LB culture. The Qiagen Midiprep protocol was used to isolate and purify plasmid at the hundreds of ng/ul level. Concentrations were estimated by applying to relationship that 50 ug of DNA equals 1 absorbance unit at wavelength = 260 nm). Concentrations were also verified by running diluted plasmid samples on agarose gels and comparing fluorescence of ethidium bromide-stained DNA to samples of known concentration. Creation of expression vectors Plasmids containing inserts were digested using the appropriate restriction enzymes so that inserts may ligated into the selected expression vectors. 60 ul restriction digest reactions were conducted as follows: 4-6 ug of DNA Restriction enzyme 1 Restriction enzyme 2 or 3 Enzyme buffer + 10X BSA dH20 Total reaction mixture x ul 3 ul 3 ul 6 ul 60 - x ul 60 ul Reactions were conducted at 37°C for approximately 5 hours. Reaction mixture was run on 0.8% agarose gels, the proper band was excised, and newly digested insert was purified using the QIAQuick Gel Extraction kit and protocol. The concentration for purified inserts was tested, and inserts were concentrated by using the methods described above in " PCR and PCR Product Purification/Concentration." Expression vectors from the Gunsalus expression kit were also digested, purified, and quantified for concentration using the same methods as when preparing the inserts. Ligation reactions were conducted using the DNA Rapid Ligation kit and protocol introduced in the "Ligation, Amplification, and Purification of PCR product" section above. Specifically, a 21 u1 reaction mixture was made consisting of: DNA (100 ng) ** DNA dilution buffer (5x) Ligation buffer (2x) Ligase x ul 10-x ul 10 ul 1 ul Total reaction mixture 21 ul ** DNA consists of vector and insert at 3:1 or greater insert:vector molar ratio. Reaction was run at room temperature for thirty minutes to four hours depending on the success of previous reactions. After the ligation reaction was assumed to have completed, 1 ul of the reaction mixture was pipetted and gently stirred into a tube of 10 ul of NovaBlue competent cells (Novagen). Transformation was attempted as according to the Novagen transformation protocol. If colonies were present on plates after the overnight incubation of transformed cells, at least 4 colonies were chosen to undergo analysis for completed expression vector. Theoretically, cells that grew on the plates (with the appropriate antibiotic resistance) should contain the insert unless plasmid was already in circular form. Cells were lysed and underwent colony PCR to determine if the insert was contained in the plasmid. If results were inconclusive, plasmid was isolated and examined by restriction digest analysis to determine if insert was present. In necessary, the plasmid presumed to have the insert was also run on gel and compared in size against a control plasmid of the same type but without insert and undigested. For creation of some expression vectors, success was immediate (e.g. ligation after the first reaction). For reactions in which no colonies were produced after transformation or for reactions in which cells did not contain completed expression vector, the ligation and transformation procedure was repeated. DNA amounts were scaled up, insert:vector molar ratio was increased, and incubation time for ligation was increased. Ultimately, all but one of the expression vectors were created for PCR products of acceptable sequence (Table 3). Biochemistry The following describes all general and some specific methods used in protein expression, purification, and analysis. These methods were applied to domain sequences from the COG 0316S and 0229S families. SDS-PAGE 17.5% SDS-PAGE gels were poured according to the following recipe. These 10 or 15lane gels were used as necessary for all experiments described in this paper. The following recipe provides for 12 gels: 40% acrylamide mix 1.5 M Tris-HCl (pH 8.8) 1.0 M Tris-HCl (pH 6.8) dH20 10% SDS 10% APS TEMED Total Bottom gel (17.5% SDS-PAGE) 21.88 ml 12.50 ml 0 ml 14.6 ml 0.5 ml 0.5 ml 20 ul 50 ml Stacking gel (4%) 2.0 ml 0 2.5 ml 15.1 ml 200 ul 200 ul 15 ul 20 ml Protein samples were mixed with a 2x SDS-loading dye and B-ME mixture. This mixture was boiled for 5 minutes. Samples were run at 150 mV in a SDS running buffer. A BIORAD "Prestained SDS-PAGE Standards Broad Range" ladder was always run in one lane of each gel to estimate molecular weight of bands. Often, lysozyme standards of known concentration were run on gels to estimate quantity of protein in each band. After electrophoresis, gels were washed in (2) 5-minute iterations of 50 ml of fresh dH20. Gels were then stained for 10 or more hours using 20 ml of Coomossie Blue G-250 (Pierce). After the staining period, gels were destained in 50 ml of dH20 for six or more hours to improve resolution, contrast, and sensitivity. Gels were scanned by using the Montelione lab scanner and Adobe PhotoShop. From March 2000 onwards, gels were scanned by using the CABM 2nd floor gel photo-imaging equipment (much better quality than previous use of scanner). Testing total protein expression (small-scale) via boil/chill cell lysis To test for the total amount of target protein expressed in E. coli in various conditions, cell lysate was analyzed on SDS-PAGE cells. Generally the procedure involved freshly transforming the expression vector into the cell line (Novagen protocol). Cells were plated and incubated overnight at 37°C. Multiple colonies were picked from plates, and each was grown in 4 ml of LB or minimal media (supplemented 2% glucose if vector contained T7 promoter) with appropriate antibiotic. This overnight bacterial culture was used to innoculate 4 ml of fresh media, which was incubated at 37°C, shaken, and monitored for OD600 by using the spectrophotometers of the Montelione or Anderson labs. When OD600 of 0.5 to 0.8 was reached, 1 or 2 mM IPTG was added to induce expression. Cells were allowed to incubate at 37°C for 3 hrs. if in LB (5 hours if in minimal media) or at 27°C for 8 hrs. if in minimal media. The recipe for minimal media used these in experiments consists of: Ammonium sulfate potassium phosphate (monobasic) potassium phosphate (dibasic) sodium citrate dH20 magnesium sulfate (0.2 mg/ml) 20% glucose Gibco Trace elements stock (10x) Ampicillin (50 mg/ml) Total 2.5 g 9.0 g 6.0 g 0.5 g 970 ml 5 ml 25 ml 1 ml 2 ml 1 liter Cells were then spun down and the supernatant was decanted and pipetted away. For the small-scale whole cell lysate evaluation, 500 ul of cells were centrifuged at ~15,000 x g. Pelleted cells were stored at -20°C if necessary. To continue protocol for small-scale whole cell lysate evaluation, cells were thawed on ice and subsequently resuspended and vortexed in SDS loading dye and B-ME. 50 ul of dye/B-ME were used for every one unit of absorbance unit. Cells were lysed by a 100°C boil for 5 minutes and chilled on ice for one minute. Cell debris was spun down at ~15,000 x g for 5 minutes. 10 ul of supernatant was loaded onto SDS-PAGE gels. Amount of protein was quantified by comparison against lysozyme standards. Testing solubility of protein (small-scale) via B-PER Testing target proteins for solubility refers to determining whether the recombinant protein is found in the cytosol or inclusion bodies of the bacteria in which it is expressed. BPER (Pierce) is a detergent reagent that supposedly lyses cells and allows one to differentiate between soluble and insoluble proteins. The procedure is similar to the above "Testing total protein expression (small-scale) via boil/chill cell lysis" procedure until cells have been centrifuged and frozen (if necessary). Then, cells from 1.5 ml of culture were resuspended in 300 ul of B-Per and vortexed until mixture was homogeneous. The mixture was centrifuged at 13,000 rpm for 5 minutes to separate insoluble (pellet) and soluble (supernatant) proteins. The inclusion bodies (pellet) may be resuspended in 1 ml of 1:10 B-PER. Samples were analyzed via SDS-PAGE. Small-scale testing for solubility -- cell lysis via sonication Sonication is another method in which cells may be lysed, and these results seem to be more indicative of solubility than the B-PER tests. However, to sonicate by directly placing the sonicator tip in cell culture, culture should be at least 1 ml in volume (for sonicator microtip of Stock lab). Thus the following procedure was used. Cells were grown as indicated in "Testing total protein expression (small-scale) via boil/chill cell lysis" section. The 4-ml overnight culture was added in a 1:10 dilution to fresh 23 ml of media in a 250-ml flask. After inducing with IPTG and incubating cells as described above, cells were spun down. The pellet was resuspended in 1 ml of a native lysis buffer (50 mM NaPi, 300 mM NaCl 10 mM imidazole pH 8.0). The Stock sonicator with microtip was used to lyse the cells (setting 4, 2 minutes on, 20 seconds on / 20 seconds off cycles) in a 2.0 ml eppendorf tube that sits in an ice-water mixture mixed with a stir bar. The cell lysate was spun down (insoluble phase), and 10 ul of the supernatant (soluble phase) was analyzed by SDS-PAGE. Large-scale protein expression and Ni+2-affinity chromatography purification -- denaturing conditions (COG 316 N-term H6-tagged) When it was observed that the 316 HSap2-14d-NtermH6 protein construct was expressed in high yield and found in inclusion bodies, a scheme was devised to purify this protein in denaturing conditions from one liter of 15N-enriched culture. The following protocol was developed for protein expression and inclusion body isolation. 1) 4 ml of transformed BL21(DE3)pLysS cells were grown overnight in LB+amp. This was added to 100 ml of 15N-enriched minimal media + amp and grown overnight at 37°C. 2) The 100 ml of culture was added to 900 ml of 15N-enriched minimal media + amp. This was incubated at 37°C till OD600 reached 0.692. Expression was then induced with 2 mM IPTG and allowed to incubate at 37°C for 5 hours (OD600 = 1.6). 3) Cells were centrifuged at 5000 x g for 30 minutes. Supernatant was discarded and cell pellet was stored at -20°C. Protein from half of the cells (i.e. 450 ml of culture) was purified in the following steps. 4) Cells were resuspended in 45 ml of native lysis buffer* and sonicated (setting 4, 10 minutes on, 30 seconds on/off intervals). 5) Lysate was centrifuged at 10,000 x g, 30 minutes, and at 4°C. Supernatant was decanted. 6) Pellet was resuspended in 45 ml of 6M GuHCl pH 8.0 and sonicated as described in step 4. 7) The solution was centrifuged at 15,000 x g, 30 minutes, and at 4°C. Supernatant (inclusion bodies) was saved and stored at 4°C. 8) Solutions and intermediates were analyzed via SDS-PAGE. * Native lysis buffer is composed of 50 mM NaPi, 300 mM NaCl, 10 mM imidazole pH 8.0. To exploit the hexa-histidine fusion tag at the N-terminus of the protein, the following protocol was used to purify using Ni+2-affinity chromatography (from Rong Xiao). 1) 1 ml of Ni+2-NTA beads was added to the solution from step 7 presumed to contain the protein isolated from inclusion bodies. Solution was incubated overnight at 4°C on a LazySusan. 2) A 20 ml BIORAD column was washed 2x with dH20 and equilibrated with one volume of 6M GuHCl pH 8.0. 3) Protein and Ni+2-NTA bead solution was added to the column. It was washed 3x with 6M GuHCl pH 8.0. 4) Protein was washed with 5 ml of 6M GuHCl, 100 mM NaPi, 10 mM Tris-HCl pH 8.0 5) Protein was washed with 5 ml of 6M GuHCl, pH 6.3, 0.1 M Na2HPO4, 0.1M NaH2PO4. 6) Protein was washed with 5 ml of 6M GuHCl, pH 5.9, 0.1 M Na2HPO4, 0.1M NaH2PO4. 7) Protein was eluted with 10 ml of 6M GuHCl, 50 mM NaAc pH 4.2. Fractions were collected in 1.5 ml eppendorf tubes. 8) Intermediates and elution fractions were analyzed via SDS-PAGE. Large-scale protein expression and Ni+2-affinity chromatography purification -- native conditions (COG 316 HSap2 C-termH6-tagged & COG 0229S Cele N-term H6-tagged proteins ) When it was observed that the 316 HSap2-23d-CtermH6 protein construct was expressed in high yield and soluble, a scheme was devised to purify this protein in native conditions from one liter of 15N-enriched culture. The following protocol was used for protein expression in one liter of culture and Ni+2-affinity chromatography purification of protein (Rong Xiao). 1) BL21(DE3)pLysS were tranformed with HSap2-23d-CtermH6 expression vector. Pick colony and incubate at 37°C overnight in 4 ml of LB+amp. 2) 4 ml of overnight culture was added to 100 ml of 15N-enriched minimal media + amp and grown overnight at 37°C. 3) The 100 ml of culture was added to 900 ml of 15N-enriched minimal media + amp. This was incubated at 37°C til OD600 reached 0.819. Expression was then induced with 1 mM IPTG and allowed to incubate at 27°C for 8 hours (OD600 = 1.676). 4) Cells were centrifuged at 5000 x g for 30 minutes. Supernatant was discarded and cell pellet was stored at -20°C. 5) Cells were resuspended in 50 ml of native lysis buffer and sonicated (setting 4, 10 minutes on, 30 seconds on/off intervals). 6) Lysate was centrifuged at 15,000 x g, 60 minutes, and at 4°C. Supernatant was collected and membrane filtered (0.45 um Corning). 7) 2.5 ml of Ni+2-NTA beads were added to a 20 ml BIORAD column. A supplied disc was place above the beads to keep them level and retain moisture. Column was washed twice with water and twice with native lysis buffer. Protein solution was loaded onto column. 8) The protein wash washed with native lysis buffer until OD280 of final 1 ml was less than 0.05 (30 ml total). 9) The protein wash washed with wash buffer until OD280 of final 1 ml was less than 0.01 (20 ml). 10) The protein was eluted with 16 ml of elution buffer and collected as fractions. 11) The initial flowthru from this column was run through several fresh Ni+2-NTA columns because several additional milligrams of recombinant protein could acquired this way. 12) Intermediates and elution fractions were analyzed via SDS-PAGE. Native lysis buffer is 50 mM NaPi, 300 mM NaCl, 10 mM imidazole pH 8.0. Wash buffer is 50 mM NaPi, 300 mM NaCl, 20 mM imidazole pH 8.0. Elution buffer is 50 mM NaPi, 300 mM NaCl, 250 mM imidazole pH 8.0 This same protocol -- except for purification of column flowthru -- was also used to purify the 229 protein because it was presumed to be in the cytosolic phase. Buffer exchange To exchange sample buffer, one of three dialysis options were used. In addition, use of a BIO-RAD desalting microcolumn allowed for rapid desalting. Gel filtration, which is described in protein purification methods, was also used as a buffer exchange method because it dilutes the sample buffer. Dialysis Testing for folding (denaturing to native conditions) – microdialysis buttons Folding is assumed to occur during a slow exchange of buffer from denaturing to native conditions. Small-scale dialysis was used to test the refolding of 316 HSap2-14d-NtermH6 tag protein. A protein sample in 6M GuHCl, 5 mM DTT pH ~7.5 was first heated at 52°C for 10 minutes to disrupt any bonding. Clean microdialysis buttons were loaded with 5 ul of protein. A wooden-peg “plunger” was used to seal a pre-soaked dialysis membrane (Spectrum Spectra/Por, Molecular Weight Cutoff 3,500) upon the opening of the button. A black, plastic O-ring was used to complete the seal. The procedure was repeated if air bubbles were observed upon sealing the membrane. Attention was given as to not touch the dialysis membrane. The button containing protein sample was immersed in wells of several 3 ml buffers varying in pH, salt concentration, and type of buffering reagent used. Dialysis was conducted at 4°C for several hours. After 21 hours, buttons were visualized under a light microscope (Stock lab darkroom), and observations about precipitate were recorded. Buttons were left at 4°C for 2 more days, and precipitate conditions did not change. Buttons were then left at room temperature and observed a few weeks later. Conditions again did not change. Small scale dialysis – cassettes When protein samples in the range of tens to hundreds of microliters needed to be dialyzed, cassettes (Slide-A-Lyzer by Pierce) were used. The cassettes allow for a small amount of protein sample to be loaded via a syringe and 18-gauge needle through very small ports in the cassette. The sample resides in a membrane pouch at the center of the cartridge. The protocol for loading and unloading the sample from the cassette is found in the Pierce Slide-A-Lyzer instruction manual. Careful attention was given as to not puncture the sample with the needle. The loaded cassette was suspended in a beaker with dialysis buffer at volume at least 1000x greater than sample volume. Sample was stirred (setting 2-4) and left at 4°C. If time permitted, dialysis buffer was replaced with new buffer at some point in the middle of dialysis. Large-scale dialysis When protein samples in the ones to tens of milliliters needed to be dialyzed, dialysis membrane tubing was used. Membrane was soaked in dH20 for 5 minutes, and sample was poured or pipetted into the tubing. Orange, dialysis tubing clips were used to seal off the tubing at both end, and sample was subject to dialysis as described above in “small-scale dialysis – cassettes” section. Buffer-exchange using BIO-RAD Desalting Micro-columns BIO-RAD P-6 "Micro Bio-Spin" Chromatography Columns are the size of eppendorf tubes. These columns retain molecules smaller than 6 kDa by use of a polyacrylamide gel matrix. They are pre-packed with a Tris-buffer, and a new buffer can replace this column with 4 cycles of wash and equilibration with that new buffer. These columns were used to buffer exchange protein samples 20 to 75 ul in volume. See BIO-RAD instruction manual for further details. Concentrating protein samples "Centri"fugal devices The Centriplus (original volume 2 to 10 ml) and Centricon (original volume less than 2 ml) from Millipore were used to concentrate 316 and 229 protein samples. Instructions can be found in manuals. There is no evidence of the 316 or 229 protein samples irrversibly binding to the YM membranes of the concentrating devices. Lyophilization (for 316 C-termH6 tag protein) The 316 C-termH6 tag protein was subjected to lyophilization. Lyophillization was tried twice, and both times it appeared from SDS-PAGE analysis that a significant fraction of the protein was lost in the procedure, perhaps as great as a 2/3 loss. It was decided that this was not a good way to concentrate the sample. Protein purification: FPLC Gel Filtration Gel filtration was conducted on the 316 C-termH6 tag protein. A Pharmacia High Load 16x60 Biotech Superdex 75 prep grade column was used. Its useful Mr fraction range for proteins is reported as 3,000 to 70,000 kDa. It was attached to the FPLC system of the Montelione lab. Two elution buffer conditions were tested: 200 mM NaCl, 25 mM NaPi, 2 mM DTT pH 6.8 and 300 mM, 50 mM NaPi, 2 mM DTT pH 8.0. Flow rate of 2.5 ml/min was used. SDS-PAGE analysis was performed on elution fractions. Anion Exchange Chromatography: 316 C-termH6 tag protein A Pharmacia Mono Q column was used to purify the 316 C-termH6 tag protein. The column makes use of a quaternary ammonium strong anion exchanger. Protein was eluted from column using a linear gradient from 0 to 1 M NaCl in 20 mM Tris-HCl, 5 mM DTT pH 7.5. A major peak was observed at 58% into the linear gradient. SDS-PAGE analysis was performed on elution fractions. Cation Exchange Chromatography: 229 protein A Pharmacia Mono S column was used to attempt to purify the 229 protein. The column makes use of a methyl sulfonate strong cation exchanger. Elutiion from column was attempted using a linear gradient of 1 M NaCl, 25 mM MES, 10 mM DTT pH 5.5. SDS-PAGE analysis was performed on elution fractions. Biophysical analysis Mass Spectrometry (mass spec) Mass spec was peformed on protein samples at the CABM mass spectrometer by Rong Xiao. 1 ul of sample was mixed with a matrix and spotted onto a plate that is inserted into the mass spec machine. Far-UV circular dichroism (CD) spectroscopy: 229 protein CD was performed on the 229 protein by Dr. Norma Greenfield of the RWJMS CD lab. 2D-NMR -- HSQC and TROSY: 229 protein HSQC and TROSY experiments on the Montelione lab 500 MHz spectrometer were run upon the 229 protein by Dr. Parag Sahasrabudhe. Spectra were referenced. Results & Discussion Note: Please make note that most methods described in this section have been described – sometimes in great detail – in the Biochemistry portion of Experimental Methods. Please consult this section if you have questions about methods. Data and reports regarding the computational biology work may be found at our COG Analysis and Expansion Website: www-nmr.cabm.rutgers.edu/~bioinformatics/cogs COG 0011S (computational biology) An expansion of the COG resulted in the growth of the domain family from 3 sequences to 9 sequences. The area conservation spans about 97 residues, which agrees with the lengths of many of the full-length proteins. This allows one to assume that all of the proteins in the domain family are single domain proteins. Since no Metazoan proteins were found to contain this domain, work was discontinued after establishing the expanded domain family. COG 0229S: C. elegans gene product (CEle1) – N-term hexa-histidine construct Alexandra Gardino of the lab had already done the computational biology and molecular biology work necessary for protein expression. This expanded domain family is conserved in at least 15 gene products. It has representation among eukaryotic and prokaryotic organisms. The region of conservation spans about 150 residues, and most of the full-length proteins in this family appear to be single domain sequences. Charles Lu of the lab had transformed expression vectors into various E. coli cell lines and tested for expression and solubility. It was decided that the C. elegans gene product of the domain family would be expressed for each of the three hexa-histidine tag loci options. Lu had also shown through small-scale testing that significant amounts of the protein were found in the cytosolic phase of the cell lysate. Lu previously purified the C-terminal hexa-histidine tag protein via Ni+2-affinity chromatography and FPLC. His CD data showed this protein to be about 50% random coil, and his HSQC data was hard to interpret either because the pH of the sample was too high or the sample was very disordered. Cell lysis and Ni+2-affinity chromatography I resuspended a cell pellet that Charles had centrifuged from one liter of E. coli culture induced to express the 15N-enriched N-terminal hexa-histidine tag protein. This pellet had been stored at -20°C for a few months. The cells were lysed and purified as described in the "large scale protein expression and Ni+2-affinity chromatography -- native conditions" section of Experimental Methods. Purification intermediates were analyzed via SDS-PAGE and compared to lysozyme standards on the same gels. The following yield of recombinant protein was projected from 1 liter of culture (Figure 5a): Total yield: Soluble yield: Protein eluted from Ni+2-NTA column: 65 mg 21 mg 7.8 mg (in 12 ml) After this eluted protein was stored at 4°C for 7 days, it was realized that precipitate had formed. DTT was added to 10 mM to this sample. Dialysis It was necessary to decrease the sample pH so that cation exchange chromatography could be used to purify the sample. The protein was originally in a pH 8.0 buffer, and its pI is 7.70. 500 ul of the sample was buffer exchanged (Pierce cassette) for a 25 mM MES, 10 mM DTT pH 5.5 buffer. After dialyzing for 24 hrs. at 4°C, some but very little precipitate was visible. The rest of the sample was dialyzed (dialysis tubing) into the same buffer as the previous step. After 10 hrs. of dialysis, much precipitate was present in the dialysis tubing. The sample's pH was 6.22; crossing over the pI probably caused this precipitation. This sample was analyzed via SDS-PAGE, and yield was estimated to be 3 mg. The old dialysis buffer was exchanged for fresh buffer, and dialysis was allowed to occur for 14 hrs. more. No additional precipitate was evident, which gives further reason to believe that the sample nearing pH of 7.7 caused precipitation. 3 mg of 9.5 ml of the sample was recovered at pH 5.45. Absorbance of the sample at 280 nm light was taken; using the relationship that 1A(280) = 0.84 mg/ml, the yield of this sample was calculated to be 3.6 mg. This is close in value to the 3 mg calculated from SDSPAGE analysis. SDS-PAGE of the sample after the dialysis step for ion exchange chromatography shows a decrease in high molecular weight proteins and an increase in low molecular weight proteins after dialysis (Figure 5). Perhaps some degradation occurred in the 40 days on which I was working on this sample. Mass spec taken on the 4oth day after the Ni+2-affinity purification shows that the monomer species is predominant and that there are some lower molecular weight species (Figure 6). This mass spec agrees with the previous SDS-PAGE analysis. Cation exchange chromatography Two cation exchange chromatography experiments, the first with 2 ml and the second with 0.5 ml of the sample loaded, were conducted using the MonoS column and FPLC system. A 1M NaCl, 25 mM MES, 10 mM DTT pH 5.5 linear gradient was used to try to elute the protein. The cation exchange chromatograms showed a very large peak before salt gradient even increased from 0%. The A(280) for these peaks were extraordinarily high; the first run had peak greater than A(280) of 1.5 and the second run had a peak of 0.45. When these and other elution fractions were visualized via SDS-PAGE, no protein was evident. The four elution fractions corresponding to the peak of the second experiment were pooled together, concentrated (Centricon) from 3.5 ml to 0.60 ml, and analyzed via SDS-PAGE. No protein was evident from these fractions. To try to determine the status of the protein during ion exchange experiments, a third experiment was conducted in which 1 ml of the sample was loaded onto the column. A buffer matching that of the sample was run through the column to try to flush the protein off the column; no protein was evident via chromatogram, SDS-PAGE, or mass spec. We deduced that the protein was bound to the column. A 1M NaCl, 25 mM MES, 10 mM DTT pH 5.5 was applied at 100% to the column to try to elute the protein. No protein was visible (mass spec and SDS-PAGE to confirm) and the large peak seen in the first 2 experiments was not visible. Next, a 2 M NaCl pH 6.34 solution (100%) was used to try to elute the protein, and nothing again was evident. Finally, a 1M NaCl, 50 mM HEPES pH 8.0 solution was used to try to elute the protein from the column. This solution has pH greater than the pI; thus, the protein overall and the column are now both negatively charge and should repel each other. Again, no eluted protein was evident (chromatogram and SDS-PAGE). The most logical explanation for why the protein does not elute from the column is that it has irreversibly bound to it. Thus, purification of this protein on a Mono S column is not an efficient step. Biophysical Characterization of 229 CEle-NtermH6 protein Because only a small amount of the sample remained after our cation exchange chromatography trials, we thought the best course of action was to analyze the remaining sample using mass spec, CD, and HSQC experiments. CD & HSQC with TROSY The sample was adjusted to 4.64 and concentrated down from 4.5 ml to 0.225 ml (Figure 5b). The 2D-NMR experiment HSQC with TROSY (referenced) contains some dispersed peaks, which is an indication that regions of the protein are folded (Figure 7). However, the “blob” of peaks in the center of the spectra indicates that the magnetic moments of several nuclei are aligned in the same direction, which results in peaks with similar frequencies. This in an indication that portions of the protein are unstructured. It may be that these peaks are partially due to the smaller molecular weight, contaminating proteins. However, CD on the same sample (15 ul of sample diluted in 285 ul of 10 mM KH2PO4, 2 mM DTT pH 5.65) showed a similar result --- half of the protein appeared to be random coil. These data were similar to that acquired by Lu on the C-termH6 tag construct (Table 4). COG 0316S: H. sapiens gene product 2 (HSap2) Computational biology An expansion of COG 316 resulted in the growth of the domain family from 9 to 39 protein sequences (Table 5), representing 16 prokaryotes and 7 eukaryotes. The domain is approximately 104 residues in length, and most of the proteins appear to be single domain proteins. The domain sequence is characterized by an overall striking sequence conservation, with several interesting sequence features. For example, the domain contains 3 conserved cysteines located next to conserved glycines. The C-terminal region also contains a motiff indexed in Prosite but of unknown function. A phylogenetic tree was made from the msa, which allowed to choose subgroups of the domain to work on (Figure 8). Ultimately, the choice was made to work on the HSap2 domain sequence. Biochemistry One note about the 316 HSap2 domain sequence. There is only one residue that contributes significantly to the extinction coefficient at A(280). That residue is a tyrosine, which results in a very low molar extinction coefficient of ~1760. This makes quantification of protein via A(280) difficult and unreliable. Comparison of SDS-PAGE results and the A(280) readings confirm this. Thus, protein was always quantified by analysis via SDS-PAGE. Expression Testing and 316 HSap2- 14-Nterm-hexahis construct Eight of the nine HSap2 expression vectors for COG 0316S (Table 3) were tested for levels of total protein expression at 37°C as described in “testing total protein expression (smallscale) via boil/chill lysis” of Experimental Methods. Extrapolations were made for how much total protein would be expressed in 1 liter of minimal media and 1 liter of LB (Figure 9). These data indicate for this domain, the expression vectors with the T7 promoter and those in the pLysS cells seem to express the best. In addition, the pET 14 construct generally expresses in high quantity. For these reasons, the pET 14 construct in pLysS cells was chosen for large scale expression. Because small-scale solubility testing with B-PER indicated that all 316 domain sequences were in the insoluble phase (inclusion bodies). 500 ml of culture containing 15Nenriched HSap2-14d-Nterm-hexahis was lysed and purified in denaturing conditions (see “largescale protein expression and Ni+2-affinity chromatography purification – denaturing conditions” in Experimental Methods). This resulted in a pure yield of about 10 mg of recombinant protein isolated from Ni+2-affinity chromatography. The pH of the sample was raised to ~7.5 and DTT was added to 5 mM. Sample was stored at 4°C for about 25 days. Attempts at folding were then made via microdialysis (see Experimental Methods: microdialysis). In each condition tested, some precipitate was found after dialysis (Table 6). A trend seemed to be that the closer the dialysis buffer pH was to the pI (5.76) of the protein, the lesser the amount of precipitate. Solubility testing via sonication During the refolding experiments, Dr. Kristin Gunsalus advised that the B-PER reagent had given her false-positives regarding insolubility of protein for some of her samples. It was also brought to my attention that reducing the rate of transcription/expression by lowering incubation temperature post-induction had the effect of increasing proportion of protein in the soluble phase. High expressing vector and cell line combinations were again tested for solubility and expression, but now at 27 and 37°C and after lysis via sonication (Table 7). B-PER was also tested, and it showed no protein soluble after the 37°C post-induction period. The 23d-CtermH6 tag construct showed promising results. Large scale expression and purification in native conditions for 15N-enriched HSap2-23dCtermH6 construct The 316 HSap2-23d-Cterm-hexahis protein was epxressed in one liter of 15N-enriched minimal media. Expression, cell lysis, and Ni+2-affinity purification are described in Experimental Methods. SDS-PAGE analysis shows the amount of recombinant protein in the following steps to be (fig: 316 – Cterm Ni+2-affinity purif gels): sonication soluble sup after filtering loss due to flow-thru recovered from elution > 72 mg 60 mg 60 mg -30 mg 9 mg It was noticed that a large amount of the target protein remained in the flowthru that passed through the column after loading the crude cell lysate. Passing this fraction through the column again did not improve recovery. When the flowthru was passed through columns of fresh Ni+2-NTA, an additional 7-9 mg of target protein could be recovered. This was done using 3 columns in series with about 2 to 5 ml of Ni+2-NTA beads per column. A trend was also observed of that the greater the volume of fresh beads used, the more protein that was recovered. This indicates that this protein easily saturates the Ni+2-NTA beads. If more Ni+2-NTA columns were used, more target protein that will be recovered. However, as the volume of the beads in a column increases, the hydrostatic pressure of the solution above the beads and the flow rate both decrease. After running this sample through 4 columns of Ni+2-NTA beads, the total of target protein recovered now increased to ~16 mg. Gel filtration studies Gel filtration showed that the majority of the protein eluted in the void volume for the column. This large peak for this volume corresponds to globular structures greater the 70 kDa (Figure 11). Observed was also a small peak at a time corresponding approximately to the dimeric weight of the protein. When the fractions corresponding to this peak were concentrated and analyzed via SDS-PAGE, the monomer appears faintly on the gel. In addition, a band corresponding to a dimer also appears on the gel. This band actually appears in every lane in every gel for which the 316 protein was run. The protein may be in equilibrium as several different molecular species. Perhaps, the protein is present in a complex with molecular weight greater than 70 kDa and with this association resistant to DTT. Subjecting the protein to the denaturing and reducing conditions of SDS-PAGE with B-ME may disassociate the protein predominantly into a monomer. However, there may be a covalent association between a small portion of the monomer units to form a homodimer. This may explain the small peak in gel filtration profile and the supposed dimeric band in the SDS-PAGE analysis. The protein may be in equilibrium as a 70+ kDa species, a dimer, and a monomer. On the other hand, this may just be evidence of aggregation and partial dissociation of the aggregate under certain conditions. Ion exchange chromatography Because the gel filtration experiment was not successful in purification and because the Ni+2-affinity chromatography leaves contaminating proteins in the sample, anion exchange chromatography was implemented. Buffer-exchanging the 47.5 ml of sample was first necessary to bring to the protein to suitable conditions (i.e. low salt) for a Mono Q column. The 50 mM NaPi, 300 mM NaCl, 250 mM imidazole pH 8.0 buffer was exchanged via dialysis for a 20 mM Tris, 5 mM DTT pH 7.5 buffer. No precipitate was evident. The entire sample was then loaded onto the Mono Q column. Flowthru was collected, and it was discovered via SDS-PAGE that a small amount of the protein did not bind the column; the column was probably saturated. The protein that was bound the MonoQ column was eluted using a 1M NaCl linear gradient (from 0 to 100%). A large, somewhat broad peak was observed at 58% into the gradient (Figure 12). Fractions 9 to 20 (~12 ml total) of the eluate were found to contain the protein via SDS-PAGE analysis. These fractions were pooled together and analyzed via SDS-PAGE (Figure 13b). Because of the appearance of the peak, a bad assumption was made that most of the protein was eluted from the column. However, when estimating via SDS-PAGE the amount of target protein in the ionexchange purified sample only ~2 mg of protein was seen. This represents an ~80% loss. Fractions 24 to 33 from the ion exchange profile were concentrated from 10 ml to 0.325 ml and were analyzed via SDS-PAGE. It was estimated that these fractions collectively contain only 20 ug of the protein. If most of the protein is not present in the elution fractions or in the flowthru, then it must still be bound to the column. It may be necessary to work with buffers that are more prone to compete with protein for the column’s sulfonate. Another possibility is that the protein may have irreversibly bound the to the column’s resin constituent. Alternatively, perhaps protein may have degraded over time. Mass spec of the ion exchange purified sample shows the largest peak to be at m/z =3720 and no peak to be present at the molecular weight of the domain sequence. However, SDS-PAGE of this sample shows the protein to be largely present as the monomer (Figure 13). In addition, comparing the Ni+2-affinity chromatography purified sample to the Ni+2-affinity chromatography and anion exchange purified sample show the protein to be slightly purer (Figure 13). Conclusion & future work We have shown that nickel-affinity chromatography and ion exchange chromatography (with considerable loss) to remove contaminating proteins from our 316 sample. Gel filtration shows the domain to form a high molecular weight species, with a multiple of bands showing on reducing and denaturing gels. This phenomenon has been observed in the past for homomultimers that are covalently bound and experience an equilibrium between different molecular weight species. CD will be performed on the purified sample of this COG. Sedimentation-equilibrium may allow us to ascertain the molecular weight of species in solution. A 30 kDa cut-off filter will be used on the sample to see if the lower molecular weight species can be separated. In addition, the protein sample may be synthesized in larger scale than before, subjected to the high-yield recovery and purification steps proven effective, and combinatorially tested for crystallization suitability. Elsewhere, some but limited insight has been gained in respect to COG 0316S. RNAi experiments (Fabio Piano -- Cornell Univ.) done using the C. elegans cDNA as template result in early larval lethal phenotype with complete penetrance. This result was shown in three repeated trials. In addition, some of the genes coding for ORF's for this domain come from operons that are active during nitrogen-fixation in some bacteria. Moreover, many neighboring genes in these operons are implicated in iron-sulfur cluster formation or transportation. The 316 domain may be a protein that is associated with iron-sulfur clusters through its conserved cysteines, and it may act as a metabolic pathway member. These conserved cysteines also give reason to analyze the protein in oxidizing conditions. We have shown that mono-S column chromatography to be ineffective for purifying domain 229. Gel filtration will be used to purify this domain sequence. Exopeptidases will be applied to see if limited proteolysis will cleave off the region of the domain sequence possibly responsible for disorder. CD and NMR will follow. References: Altschul, Stephen F., et al. NAR. 25:3389-3402. 1997. Bairoch A., Apweiler R. NAR. 28:45-48. 2000. Chothia, C. Nature. 357:543-544. 1992. Corpet, F., Gouzy, J., Kahn, D. NAR. 26: 323-326. 1998. Corpet, F., Gouzy, J., Kahn, D., NAR. 27:263-267. Feng, W., et al. Biochemistry. 31:10881-96. 1998. Gunsalus, K.C., et al., submitted. 2000. Hofmann K., Bucher P., Falquet L., Bairoch A. NAR. 27:215-219. 1999. Holm, L. & Sander, C. NAR. 25:231-234. 1997. Holm, L. & Sander, C. Science. 273:595-602. 1996. Huang, X. Genomics. 33:21-31. 1996. Hwang, K.Y., et al. Nature Structural Biology. 7:691-696. 1999. Levitt, M. & Gerstein, M. PNAS USA. 95:5913-5920. 1998. Lima, C.D., Klein, M.G., & Hendrickson, W.A. Science. 278:286-290. 1997. Montelione, G.T. & Anderson, S. Nature Structural Biology. 6:11-12. 1999. Montelione, G.T., et al. NMR Pulse Sequences and Computational Approaches for Automated Analysis of Sequence-Specific Backbone Resonance Assignments of Proteins. Biological Magnetic Resonance, Volume 17: Structure Computation and Dynamics in Protein NMR. Eds. Krishna & Berliner. Kluwer Academic / Plenum Publishers: New York. 1999. Patthy, L. Matrix Biology. 5:301-310; discussion 311-312, 1996. Rost, B. Meth. in Enzym. 266: 525-539. 1996. Rost, B. & Sander, C. J. Mol. Biol. 232: 584-599. 1994. Rost, B. & Sander, C. Proteins. 19: 55-77. 1993. Schuler, et al. Science 274, 540-546. 1996. SPARKY: http://www.cgl.ucsf.edu/home/sparky/ Tatusov, R.L., Koonin, E.V., & Lipman, D.J. Science. 278:631-637. 1997. Terwilliger TC & Berendzen J. Genetica. 106(1-2):141-7, 1999. Thompson, J.D., Higgins, D.G., Gibson, T.J. NAR. 22:4673-80. 1994. Tilghman, S.M. Genome Research. 6:773-780. Legends Figures Figure 1. Computational biology protocol. Steps up to "Search EST databases" were applied to COG 0011S to expand the domain family. This entire protocol was applied to COG 0316S to expand the domain family and select sequences to express. Figure 2. Molecular biology protocol. This protocol was applied to COG 0316S. 5 coding sequences were ligated into the "holding" vector, but only 3 of these coding sequences were of suitable quality for ligation into expression vectors (see tables 2 and 3). Italicized steps are those that may be omitted in a high-throughput operation. Figure 3. Digest and cloning scheme for formation 316 CEle1 (a) and HSap2 (b) expression vectors. A single PCR product may be produced for ligation into 9 different expression vectors of the Gunsalus expression kit (Gunsalus, et al., 2000). To preserve the C-terminal protein sequence for the N-terminal hexa-his tag and non-fusion constructs, 2 different PCR products were synthesized (indicated by the 2 types of reverse primers for CEle1). Changes were made to the 5' and 3' regions of the coding sequence for each domain to accommodate ligation into the expression vectors. In some cases, this changed the native protein sequence at its N-terminal and/or C-terminal regions. This is indicated in the peptides sequences section of the (a) and (b). Figure 4. Sequence for 229 CEle Nterm hexa-histidine tag construct. Computational biology, molecular biology, and protein expression and solubility testing for domain family 0229S were done previously by Alexandra Gardino & Charles Lu. The sequence of is 162 residues long, has a molecular weight of 18,449 g/mol, molar extinction coefficient of approximately 21870, pI of 7.70, and charge of 2.85 at pH 7. Make note of the 8 cysteines. Computation biology work for this COG may be found at the CABM NMR Lab COG Analysis & Expansion Website. Figure 5. Purification of 229 CEle Nterm hexa-histidine tag construct. (a) shows protein purified using Ni+2-affinity chromatography under native conditions. Yield is approximately 8 mg of target protein. 1. total cell lysate sonicated 2. soluble protein 3. column flowthru 4. washes (b) shows protein after Ni+2-affinity chromatography and dialysis to pH 5.45 buffer. During this pH reduction from pH 8.0, some of the protein precipitated, reducing yield to approximately 3 mg of target protein. It also appears that some of the higher molecular weight protein has reduced in quantity while lower molecular weight protein has increased in quantity (compare a and b). This may indicate degradation of higher molecular weight contaminants. Figure 6. Mass spectrometry of 229 CEle Nterm hexa-histidine tag protein that had undergone nickel affinity purification plus dialysis across pI (figure 5b). The major peak corresponds to the protein’s molecular weight. A smaller, higher molecular weight peak indicates that a portion of the protein forms a dimer during mass spectrometry conditions. Some lower molecular weight contaminants remain in the sample after Ni+2-affinity chromatography and dialysis across the pI of the protein. These data are consistent with figure 5b. Figure 7. 229S CEle Nterm hexa-histidine tag protein HSQC with TROSY for (pH 4.64). Figure 8. Unrooted phylogenetic tree of domain family 316. The tree graphically represents the similarity between sequences. Lengths of branches correspond inversely to degree of similarity between sequences on the branches. Branching is an indication of a supposed evolutionary split between similar sequences. Those sequences that we chose to clone are circled. Note how they are in different groups. Figure 9. COG 0316S HSap2 small scale expression testing. LB and MJ9 (minimal) refer to media types. DE3 and pLysS refer to expression cell lines of the BL21 series available from Novagen. Testing was conducted at 37°C. Whole-cell lysate was analyzed via SDS-PAGE. Vector construct notation is explained in figure 3. Figure 10. 316 HSap2-Cterm-hexahis tag protein purified via Ni+2-affinity chromatography. Gel 1 shows that status of the protein during this purification. Gel labeling refers to: 1 Total cell lysate (via sonication) after expression 2 Supernatant (soluble proteins) of total cell lysate 4 #2 filtered using sterile 0.22 um filter 5 Flow-thru from nickel column affinity purification. WASH … Washing of nickel column (bound with column) with increasing concentrations of imidazole solutions (300 mM NaCl, 50 mM NaPi, pH 8.0; imidazole increased from 10 to 20 to 250 mM (250 is elution concentration). 5. 4 ug lysozyme (for standard) Gel 2 shows the elution fractions from this purification. Gel 3 shows a nickel-affinity column (fresh nickel) purification of the flow-thru (gel 1, lane 5) of the previous column run. Gel 4 shows that a significant amount of the protein is still present in the flow-thru. This flow-thru was run through 4x more through nickel columns to recover a total of about 12 mg protein. Figure 11. Gel filtration of 316 HSap2-Cterm-hexahis tag protein. (a) refers to chromatogram of the experiment. SDS-PAGE analysis showed most of the protein to be present in the void volume (40 ml elution point). This corresponds to Mr > 70 kDa. A very small amount of the protein elutes present in the 62 ml point. This roughly corresponds to the protein’s dimeric molecular weight. (b) represents a calibration curve for the column. Figure 12. Anion exchange chromatography purification of 316 HSap2-Cterm-hexahis tag protein. (a) refers to a chromatogram from purification on a Pharmacia Mono Q column. A linear gradient of 1M NaCl buffer was used for elution. Elution buffer (pH 7.5) also contained 20 mM Tris, 5 mM DTT. Protein predominantly elutes from column 58% into the 1M NaCl gradient. (b) refers to SDS-PAGE analysis of elution fractions corresponding to figure 12a. Figure 13b shows yield after this purification step. Figure 13. COG 0316 HSap2 Cterm hexahis after Ni+2-affinity chromatography and anion exchange chromatography purification. (a) refers to protein just after Ni+2-affinity chromatography. (b) refers to protein concentrated after Ni+2-affinity chromatography and anion exchange chromatography purification. Yield was reduced by ~80% during the anion exchange chromatography experiment. However, higher molecular weight contaminants were removed during this step. Tables Table 1. DNA acquisition for 316 domain. Genomic DNA was obtained from neighboring labs. cDNA was ordered from Genome Systems of the IMAGE Consortium. The yeast gene can be obtained from a cosmid in which it has been cloned. However, we were able to conveniently use genomic DNA from a neighboring lab. The shading indicates that we did not use the cosmid. Table 2. Partial listing of PCR errors generated during cloning of coding sequences for domain 316. Sequence errors for all PCR products were found in the region coded by primers. No pattern of errors is evident from analysis of primer sequence and PCR product sequence. In addition, the same primer batch gave different patterns of errors for separate PCR reactions. The errors limited the number of expression vectors that were constructed (table 3). Table 3. Expression vector status for 316 domain. In most cases, 2 types of PCR products were made. One product (C type) was made in which the DNA encoding the C-term portion of the domain was modified to accommodate a hexa-histidine tag. Another product (N type) was made in which native sequence at this C-term encoding region was left intact. Table 4. Circular dichroism data for 229 CEle-Nterm-hexahis protein and CEle-Cterm-hexahis protein (Lu data). These data consistently show that the 229 hexahis fusion protein is approximately 50% random, 15% alpha helix, 32% beta strand, and 4% turn. Table 5. Members of the expanded 0316 domain family. The 316 domain family was expanded to include 39 sequences – 16 of the sources being prokaryotic and 7 being eukaryotic. Red indicates original COG member. Black/green indicates those added to the expanded COG 0316S family. Green indicates sequence currently being investigated by us. ID refers to label given by us to each protein sequence. Swiss-Prot ID's begin with "P", "Q", or "Z". (P) indicates prokaryote and (E) indicates eukaryote. EST appearing under protein length signifies that the entire sequence of the protein is not known. Table 6. Folding of 316 HSap2-Nterm-hexahis: microdialysis from denaturing to native conditions. Original protein buffer was 6 M GuHCl pH ~7.9 and all solutions contained 5 mM DTT. Protein was heated at 52°C for 10 minutes before dialyzing. Concentration of protein was ~250 µg/ml. Observations for precipitate were made under light microscope after 10 hrs. of dialysis at 4°C. Samples were again checked 12 hrs. after initial observations; no changes occurred. Samples were then allowed to sit at room temperature for 2 days; no changes occurred. Table 7. Small-scale solubility testing of expression of COG 0316 Hsap2 variants using sonication for cell lysis. Testing was performed in minimal media. Whole cell lysate (total protein) and supernatant (cytosolic protein) was analyzed via SDS-PAGE. Target protein levels were projected for one liter of bacterial culture. Each line contains mg of soluble protein divided by mg of total protein for a temperature/cell line condition. "Inc" stands for inconclusive, meaning solubility of protein could not be judged due to unsuccessful sonication. Cells went through post-induction period of 5 hours at 37C or hours at 27C. These promoter / tag location / cell line conditions were also tested for 27C / 8 hr. post-induction period using the detergent B-PER to recover protein. These tests for all 5 different conditions indicated that protein was insoluble. Promoter / tag location / cell line combination chosen for large-scale expression and purification under native conditions is boldfaced. Choose COG (small, unknown function). Create an msa of sequences. Decide upon ends of domain (regions of conservation and start/stop codons). Write a consensus sequence from alignment (degeneracy allowed). Check PDB for possible homologues (BLAST). Check for transmembrane features (PredictProtein). Use HMMER to search nr db for possible homologues. Determine if sequences should be added to expanded COG (E scores, region of alignment, phylogenetic tree, msa). Check PRODOM & Prosite for any possible similar domains and motifs. Check Swiss-Prot for any functional information about sequences. Search EST databases for possible homologues (Unigene). If similar EST's are found, create a consensus among the overlapping EST's (Unigene, CAP, 6-frame translation). Add to msa/tree and decide if new sequences are possible homologues. Finalize domains ends (previous, 2° structure prediction). Select sequences to express (tree, availability of DNA/cDNA, intron interrupting ORF, common restriction sites in DNA, ease in primer design). Figure 1 Determine sequence of DNA that will be "PCR-ed out" (sequencing, if necessary) Strategically design primers (restriction sites) PCR out coding sequence from genomic or complementary DNA Run insert on and purify from gel Adjust insert to appropriate concentration Ligate into a holding/amplification plasmid Transform into amplification / holding cells, plate, and incubate Pick multiple colonies according to blue (negative) / white (positive) selection Verify ligation of insert using restriction enzyme digests and/or colony PCR Sequence insert and verify for accuracy according to database and primer sequence Amplify insert in holding vector in large bacterial culture, and purify plasmid Digest insert from vector strategically for ligation into expression vectors Run insert on and purify from gel. Concentrate insert. Prepare expression vectors (amplify digest strategically, purify, and concentrate) Ligate insert into expression vectors Check to see if insert is present Figure 2 (a) (b) Figure 3 MGHHHHHHSHMTTKKFRMEDVGLSKLKVEKNPKDVKQTE WKSVLPNEVYRVARESGTETPHTGGFNDHFEKGRYVCLCCG SELFNSDAKFWAGCGWPAFSESVGQDANIVRIVDRSHGMHR TEVRCKTCDAHLGHVFNDGPKETTGERYCINSVCMAFEKKD Figure 4 1 Figure 5 2 3 4 Elution Fractions Figure 6 is a mass spectrometry result file of COG 229 Cele1-NtermH6 tag-14 protein. It is currently not on the web. It is available from the author upon request. Figure 8 is a phlyogenetic tree of the COG 316 domain family. It is viewable on the web at the CABM NMR Lab COG site, which is approachable from www-nmr.cabm.rutgers.edu 1 2 3 4 6 5 WASH ELUTION FRACTIONS Gel 2 Gel 1 ELUTION FRACTIONS 2ug lysozyme Gel 3 4ug lysozyme Figure 10 FLOW THRU FROM NICKEL 2ug lysozyme AFFINITY COLUMN 4ug lysozyme Gel 4 (a) (b) Figure 11 (a) (b) 17 1 1 Figure 12 9 11 17 18 19 22 Purified by NI+2-affinity chromatography (a) Figure 13 Purified by NI+2-affinity and anion exchange chromatography (concentrated). (b) My Code for Domain Bsub1 Species Strain Accession Number for Protein/DNA target Source / what was given Bacillus subtilis 168 Protein = 2635713 (ncbi) Genome = Z99120 (ncbi) S. Anderson via Rehan Azia / genomic DNA Hinf1 Haemophilus influenzae RD/KW20 Scer2 Saccharomyces cerevisiae S288C/AB97 2 Protein = P45344 (SP); 1175501 (ncbi) Genome = 1574575 (ncbi) Protein = Z12425 (SP); 2495215 (ncbi) Genome = Z71255 Chromosome cosmid = 805025 (ncbi); Z49219 (ncbi) C41876 C31399 EST = AA727377 (ncbi) S. Anderson via Rehan Aziz / genomic DNA Steve Brill / genomic DNA Cele1 Mmus1 Caenorhabditis elegans Mus musculus CB1489 him8(e1489) C57BL/6 Hsap2 Homo sapiens From brain; anaplastic oligodendend rogliome tissue Table 1 EST = AI202743 (ncbi) XX Yuji Kohara (Japan) / EST WashU-HHMI Mouse EST Project; contact Marra M / Mouse Project via IMAGe (EST) Contact: Robert Strausberg, Ph.D. (NCI-CGAD) via IMAGE (EST) CDNA# (necessary for ordering) Reported Length (if applicable) Other notes ORF we need is complement of 107656..108018; We have strain 1A243 ORF we need is 2450..2794 of U32845 ORF we need is 151843..152400 Cosmid 9299 Yk282c1 IMAGE: 1209939 IMAGE: 1859411 ORF we need is 30083..30640 Insert 360 bp 447 bp 450 bp Part of Unigene Mm.7884 Part of Unigene Hs.63913 PCR Product BSub1-1 BSub1-2 CEle1-1 CEle1-2 HInf1-1 Table 2 Expected sequence CGA AGC TTA TTA GGT T GC Observed sequence CGA ATC GTA GTA CTT G CG HInf1-1 GGGATCCTG TCTGATCCGG G Ax4 Cx4 HInf1-1 TCC TAA HSap2 GCC GCT SCer2-1 T G Comments Reverse primer. Errors cause loss of restriction site and a stop codon. Error causes lys pro. Forward primer Error is at 5' end of forward primer. Upstream of first restriction site; will not cause a problem No errors Forward primer. Several errors cause loss of BamHI site. Reverse primer. Substitution error occurred at least 4 times. Reverse primer. Caused loss of stop codon. Forward primer. Codons are synonymous for ala and are not rare. Will not cause problems. Reverse primer. Loss of HindIII site. My Code for Domain Species PCR product(s) made? Sequence error(s) in PCR product? Nature of error(s) # of expression vectors cloned Types of expression vector cloned (by affinity-tag) N-term H6 C-term H6 Nonfusion 3/9 0 0/9 Bsub1 Bacillus subtilis Yes Hinf1 Haemophilus influenzae Saccharomyces cerevisiae Caenorhabditis elegans Yes C type is fine. N type has error in primer annealing region. Yes Yes Yes Yes Mus musculus Homo sapiens No Yes C type has error in primer annealing region. N type is fine. not applicable None Scer2 Cele1 Mmus1 Hsap2 Table 3 0 0 T7 T7 lac T5 lac2 0 0/9 0 0 0 5/9 T7 T7 lac T5 lac2 0 T7 T7 lac 0 T7 0 T7 T7 lac T5 lac2 T5 lac2 0 T7 T7 lac T5 lac2 0/9 8/9 0 structure type alpha beta random turn Table 4 Per cent secondary structure 25 degrees 10 degrees Lu data Celsius Celsius 15.8 11.7 14 29.2 32.6 39 48.5 52.1 44 6.4 3.4 3 ID Swiss Prot ID or GenBank ID ECol1 Ecol2 Ecol3 HInf1 Hinf2 SSp1 SSp2 SCer1 SCer2 AVar1 Avar2 ASp1 P36539 P77667 P37026 P45344 P44672 P72731 P74596 Q07821 Q12425/Z12425 P46051 P46052 P18501 PBor1 MTub1 SPom1 SPom2 RSp.1 PPur1 BJap1 ABra1 AVin1 Avin2 P46053 Q10393 P78859 2950483 Q53211 P51217 P37029 Q43895 Q44540 2271523 Plectonema boryanum (P) Mycobacterium tuberculosis (P) Schizosaccharomyces pombe [strain 972h- for Spom2] (E) Rhizobium Sp. (P) NGR234 Porphyra purpurea (P) Bradyrhizobium japonicum (P) Azospirillum brasilense (P) Azobacter vinelandii (P) 121 118 190 205 106 114 118 125 107 107 FAln1 RCap1 RSph1 BAph1 Q47887 Q07184 Q01195 2738590 Frankia alni (P) Rhodobacter capsulatus B10 (P) Rhodobacter sphaeroides (P) Buchnera aphidicola (P) 135 106 106 113 AAeo1 BSub1 CPCC1 Ovol1 Cele1 2984147 2635713 2183309 AA68340 C41876 116 120 119 EST EST Rat1 Rat2 Mmus1 Mmus2 Hsap1 Hsap2 Unigene Rn.3442 AI059493 Unigene Mm.7884 3447426 Unigene Hs.10473 Unigene Hs.63913 Aquifex aeolius (P) Bacillus subtilis 168 (P) Cyanothece PCC8801 (P) Onchocerca volvulus (E) Caenorhabditis elegans (E) CB1489 him-8(e1489) Rattus norvegicus (E) Unknown Unknown May be required for Nfixation May be required for Nfixation Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Role in Fe-S cluster assembly (?) [UI 98250785] Unknown Unknown Unknown May be involved in septum formation during cell division [UI 98087557] Unknown Unknown FifENXW operon (?) Unknown Unknown EST EST EST EST EST EST Unknown Unknown Unknown Unknown Unknown Unknown Table 5 Organism Protein Length (a.a.) Escherichia coli (P) 107 122 114 Haemophilus influenza RD/KW20 [strain 114 for Hinf1] (P) 107 Synechocystis Sp. (P) 118 113 Saccharomyces cerevisiae (E) 250 185 Anabaena variabilis (P) 123 123 Anabaena Sp. (P) 113 Mus musculus (E) C57BL/6 [strain for Mmus1] Homo sapiens (E) Functional Implications Unknown Unknown Unknown Unknown Unknown Unknown Solution pH Precipitate Med-high Med-high Precipitate when in Buffer + 0.1 M NaCl High Med-high 7.0 7.0 Med-high Med Med Med 7.5 Light Light-med 10 mM NaOAc 5.0 10 mM NaPi 6.0 10 mM NaPi 10 mM Bis Tris 10 mM Tris Table 6 promoter T7 T7 lac Table 7 H6 tag location N-term C-term 0/5.3 : 37C, DE3 Inc/12: 37C, DE3 Inc/20 : 37C, pLysS Inc/16: 37C, pLysS 1/24 : 27C, DE3 16/20: 27C, DE3 2/32: 27C, pLysS 16/>24: 27C, pLysS Not tested Inc/12: 37C, pLysS 0/24: 27C, pLysS Chapter 2: A simulation of structural similarity detection as a method for prediction of orphan gene product biochemical and cellular function Abbreviations PDB, Protein Data Bank; DALI, Distance Matrix Alignment; SCOP, Structural Classification of Proteins; NCBI, National Center for Biotechnology Information; CDS, coding sequence; PIR, Protein Identification Resource Database; NR, nonredundant. Abstract Many of the proteins corresponding to genes currently identified from large-scale nucleic acid sequencing projects show no significant sequence similarity for homology detection of proteins that have been functionally characterized well; hence, they fall under the category “orphan genes.” Scientists have developed alternative methods for predicting biochemical and/or cellular functions of these proteins. Results from these predictions can greatly assist in future functional studies of the proteins. We have used the available collection of protein structural data and popular sequence and structure comparison programs to conduct a simulated experiment showing that structural comparison using functionally characterized protein structures can be used as tool for predicting the biochemical and/or cellular functions of protein domains of unknown biochemical and/or cellular function when sequence similarity fails to present any statistically significant homologues. Tested were 10 randomly selected domains of different folds, and we have found that the major candidate functions provided by the poor sequence similarity, structural hits of eight of the domains provided clues that led back to the function(s) of those domains. Introduction The goals set forth by the Human Genome Project have led scientists to develop new ways and refine existing ways of detecting the functional implications of genes and the proteins for which they code. Many “genomicists” are making direct use of the sequence data obtained from the Project to better their understanding of the roles of the proteins coded by the recently identified genes. In these cases the sequence information alone is enough to provide scientists with a sequence homology to functionally characterized proteins, enabling them to interpret the cellular and/or biochemical function of their studied protein. A strong sequence similarity or identity with another protein is taken as a sign that the two proteins are homologous. When a homology exists between two proteins it is assumed that they have evolved from a common ancestor and share function. However, many proteins only show a possible homology with proteins of known function in the “twilight zone” of homology, a range that shows no reliable way for determining homology. In addition, a large proportion of the sequenced genes shows no similarity or identity, and thus no evident homology, with any proteins of known function. Both of these groups make up the so-called “orphan genes” (Holm & Sander, 1996), which account for between 1/3 to1/2 of newly sequenced genes (Botstein et al., 1997; Casari et al., 1996). The question that arises is which alternative approach to use in functional prediction of genes. Another possible approach is structural similarity detection, a method that has not been thoroughly investigated in literature. This approach involves solving the structure of the scrutinized protein, searching the protein structure database (e.g. PDB) using an algorithm (e.g. DALI) designed to detect significant structural similarity, and investigating the biochemical and cellular functions of the scrutinized protein’s homologues. The progress of NMR and X-ray crystallography is adding structures at rate well over a thousand structures per year to the thousands already housed in the PDB (found in statistics page of http://pdb.pdb.bnl.gov). This increase in data plus the recent strives to improve the structural and functional databases that manage this data are providing a growth in the force of this approach (Holm & Sander, 1996). Recent work has shown that structural similarity detection has been much more powerful than sequence similarity detection for isolating homologous proteins or protein substructures during examinations of databases and protein superfamilies. (Levitt & Gerstein, 1998; Holm & Sander, 1997; Lima et al., 1997). This is true because of the millions of physically possible amino acid combinations that can occur over the length of a typical domain filter down to a relatively small number of natural protein folds, estimated to be in the range of 1000-1500 (Chothia, 1992). This provides impetus for us to understand nature with a "structure-based" approach to functional genomics in the following experiment. We simulated as though the structures of 10 domains were solved (in reality they were already solved and functionally characterized) and no homologues were known by lenient measures of sequence similarity. After comparing the structures of the 10 domains to those in the PDB, and after an inspection of the reported functions of those nonredundant, significant structural hits of insignificant sequence similarity, we found that the major candidate functions suggested by the inspected hits offered valuable insight into the biochemical/cellular functions of eight of the 10 target domains of "unknown" function. Experimental Procedures Target Selection 10 protein structural domains were chosen from the version 1.37 of the SCOP database (Murzin et al., 1995; Brenner et al., 1996; Hubbard et al., 1997; found at http://pdb.pdb.bnl.gov/scop/). Domains were randomly chosen from the following four classes: (a) all alpha (b) all beta (c) alpha and beta (a/b) mainly parallel beta sheets (d) alpha and beta (a+b) mainly antiparallel beta sheets so that 3 domains were each obtained from classes 1 and 3 and 2 domains were each obtained from classes 2 and 4. Structure Search & Comparison Each of the 10 domains obtained from the SCOP database correspond to structures housed in the PDB (Abola et al., 1987; Bernstein, et al., 1977), with the PDB entry for only one of the targets (1gnd) containing more residues than the domain because it is non-contiguously composed. Files containing the three dimensional coordinates for each of the domains were obtained from the February 9, 1998 version of the PDB. The coordinates of each the 10 target, PDB entries were submitted interactively to the DALI server (Holm & Sander, 1993; found at http://www2.ebi.ac.uk/dali/), and a file reporting PDB-housed protein chains structurally similar to the submitted query was returned. DALI uses an algorithm implementing distance matrix representations to compare C values of a peptide chain against C values of other protein chains. The output file contains structural neighbors that are ranked by z-score, which is a length-dependent, statistical significance representation of the protein chain’s similarity score to the submitted protein chain. This provides a universal quantitative measure for the strength of the comparison against a specific population. All of the structural neighbors provided by DALI have z scores greater than or equal to 2.0, meaning that they all show statistical significance in structure similarity to the target protein chain (Holm and Sander, 1994). Sequence Similarity Searching To judge how the structural neighbors were related in amino acid sequence to the target peptide, version 3.06 of FASTA was used (Pearson & Lipman, 1988; Lipman & Pearson, 1985). FASTA is a program that searches for sequence similarity locally and allows for the introduction of gaps during sequence alignments (Pearson, 1996). It ranks the sequence neighbors by expected (E) score, which is a statistical representation of how likely it is to obtain a particular sequence similarity score while searching a sequence against a particular population. A local version of this program was run, in which the sequences of each of the 10 protein targets was searched against a local version of the nonredundant (NR) protein sequence database of NCBI (found at the BLAST ftp site ftp://ncbi.nlm.nih.gov/blast). The database contains all of the protein sequences, identification codes, and header info from GenBank , CDS translations, the PDB, SWISS-PROT, and PIR. For each of the 10 FASTA searches, the following parameters were used in searching the 309,341 sequences of NR database: ktup = 2, gap-pen = 12/-2, BLOSUM 62 scoring matrix from the NCBI BLAST ftp site. To correlate the sequence comparison results with the structural neighbors, it was necessary to search for the PDB id codes of each of the structural neighbors in the NR database file, and subsequently use the gi’s assigned to the PDB entries to probe the FASTA search results. Elimination of Redundancies Due to the PDB being highly redundant in the structures that it houses (Holm & Sander, 1988), it was necessary to represent structural neighbors that were the same molecule with just one molecule in each of the DALI searches. The NR database, which groups protein entries of identical amino acid sequence from different databases and within the same database, was used to discern which of the structural neighbors were 100% redundant with each other in each of the DALI searches. In choosing which PDB entry should represent its set of redundant structural neighbors, two criteria were first implemented: a) For a set of redundant PDB entries, we chose to represent the set with the entry of the lowest z score. This method ensures that selection for structural neighbors, when necessary, is done in a conservative manner. b) Because the target molecule was included in the list of structural neighbors, it sometimes appeared in sets of redundant molecules. When this occurred, the target molecule was chosen to represent the set of redundant molecules. Using these criteria resulted in a 45% reduction of total structural neighbors, most prominent in the structural neighbors of trypsin (1sgt) and dihydrofolate reductase (1dhf-a) because these proteins are highly investigated due to their roles in therapeutic approaches to disease. However, the preservation of accurate structural representation was a concern because there was some variation in structural similarity within sets of redundant entries. For the 49 sets of redundancy among the structural neighbors, the values for the range, the lowest z score among the redundant set subtracted from the highest z score among that same redundant set, showed the following properties: median of 1.1, mean of 2.71, and standard deviation of 4.43. There was not a very large difference in structures within the redundant sets except for a few extremes. In these extremes, one structure was very different from the rest of the structures within the set due to different ligands associated with the molecule, different conformations for different subunits of the protein, etc. To allow for accurate structural representation of the redundant set of structural neighbors, these extreme values were excluded from the sets of redundancies and included in the pool of nonredundant structural neighbors. This resulted in a total of 396 non-redundant, structural neighbors, a 44% from the total structural neighbors. The range values for the redundant sets then showed the following properties: median of 0.8, mean of 1.6, and standard deviation of 1.60. In addition, because very few redundancies were found in the set of structural neighbors that showed poor sequence similarity (E score > 10) to their targets, these PDB entries were manually inspected for evidence of being from the same protein from the same organism. No redundancies for found this way. Functional Investigation Biochemical and cellular function information was obtained for each of the nonredundant, PDB chains that showed poor or no sequence similarity (E score > 10) with its target by searching primarily the functional activity areas of the SWISS-PROT (Bairoch & Apweiler, 1998; Apweiler et al., 1997) protein entry for the chain’s encompassing protein. “SWISS PROT is a curated protein sequence database which strives to provide a high level of annotation (such as the description of the function of a protein …)” (from http://expasy.hcuge.ch/sprot). When SWISS-PROT did not provide enough functional info, additional information was retrieved from literature and the header of the PDB entry. Results and Discussion Targets The simulated experiment was conducted as though the structures of 10 domains, which can be found in the following PDB entries and are designated by SCOP, were solved. The following contains the PDB id, entry name, and source organism. (a) 2aak (formerly 1aak), ubiquitin conjugation enzyme from Arabidopsis thaliana (Cook et al., 1992). (b) 1cks (chain A), cyclin-dependent kinase subunit type 2 (CksHs 2) from Homo sapiens (Parge et al., 1993). (c) 3sdh (chain A), hemoglobin I (homodimer, unliganded and carbon monoxide liganded states) from Scapharca inaequivalvis (Royer, 1994; Royer et al., 1990, 1989). (d) 1vih, vigilin, repeat 6 from Homo sapiens (Musco et al., 1996; Castiglione Morelli et al., 1995). (e) 1dhf (chain A), dihydrofolate reductase (DHFR) complex with folate from Homo sapiens (Davies et al., 1990; Prendergast et al., 1988). (f) 1hoe, alpha-amylase inhibitor Hoe-467A from Streptomyces Tendae 4158 (Pflugrath et al., 1986). (g) 2wrp (chain R), trp repressor (orthorhombic form) from Escherichia coli (Lawson et al., 1988, Zhang et al., 1987; Schevitz et al., 36; Joachimiak et al., 1983, 1983; Gunsalus & Yanofsky, 1980). (h) 1sgt, trypsin from Streptomyces griseus, strain K1 (Read & James, 1988). (i) 1fps, farnesyl diphosphate synthase from Gallus gallus (Tarshis et al., 1994). (j) 1gnd, guanine nucleotide dissociation inhibitor (alpha isoform) from Bos taurus (Schalk et al., 1996, 1994). Structure vs. Sequence The amount of structurally similar, nonredundant PDB entries that would not be found by sequence similarity search using FASTA and the parameters indicated is almost 3 times greater than those that would have been found by sequence similarity search (see Figures 1 & 2). 8 of the 10 searches resulted in nonredundant structures that were similar to the target domain but retained very poor sequence similarity (E > 10). In fact, 284 of the 287 PDB entries that make up the “Not Probable” category had E score over 100, which is well into the region populated by non-homologous peptides. Among the random sample of domains chosen were ones from trypsin (1sgt) and dihydrofolate reductase (1dhf-a), both the subject of many structural studies due to their biological and potential therapeutic effects. The nearly three quarters of total hits being of poor sequence similarity would certainly be higher in proportion if the hits, many being orthologues (same protein, different organism) of the 2 targets, of 1sgt and 1dhf-a were derived from less studied structures. Structural similarity seems to be conserved as sequence similarity diminishes (see Figure 3). While it is true that many of the more significant structural matches, including the targets matched upon themselves, are found in the highly probably region of the graph, almost 3 quarters of the structurally significant hits are clustered in the poor sequence similarity region (NP) of the graph. 275 of the 287 hits in the NP region are crowded in the area with log (E) = 2 and 2 z 10. Functional Analysis Poor sequence similarity, structural hits can provide a wealth of biochemical and cellular information about the target structure’s function(s). Without knowledge of the hits that would probably be detected by traditional sequence similarity search methods, including those in the “twilight zone”, a list of candidate functions can be developed to gain insight into the biochemical and cellular function(s) of the target. This was possible in 8 of the following 10 cases (See table 1). 1sgt Trypsin, a well-studied serine protease, yielded only 8 poor sequence similarity, structural hits. The probable function for this target domain was easy to spot because most of its structural hits had the same function. Also, the precursor origin of this enzyme, as it does come from trypsinogen, is suggested by half of the hits coming from zymogens. This is a clear example of detection of homologues beyond the grasp of sequence similarity detection but not structural similarity detection. 1dhf-a Dihydrofolate reductase (DHFR), another well-studied enzyme, participates in a twice reduction of folate to dihydrofolate and ultimately tetrahydrofolate with the assistance of NADPH (Garret & Grisham, 1995). DHFR’s binding to specific portions of NADPH is suggested by the candidate functions. 85% of its hits binding to purine-based nucleotides, with emphasis placed on adenine-containing nucleotides, suggests that the target might bind to a similar molecule. DHFR does bind adenosine, a component of NADPH (Zheng et al., 1993; Basran et al., 1997). In addition, this candidate function agrees with the finding that DHFR’s interaction with the pyrophosphate moiety of NADPH provides most of the energetically favorable binding interactions in the NADPH-DHFR complex (Bystroff et al., 1990). 50% of the hits provide phosphorylation or kinase functions, which suggests that the target may interact with phosphate group(s). In addition to its pyrophosphate-binding function, DHFR has a specific, basic residue that binds to charged oxygens of the ribose 2’-phosphate group of NADPH and plays and important role in binding the coenzyme (Gargaro et al., 1996; Bystroff and Kraut, 1991). Studies have also shown that this phosphate binding site is why DHFR interacts quantitatively much better with NADPH than NADH, and this may explain why NADPH is the preferred coenzyme for DHFR (Huang et al., 1990; Rancourt & Walker, 1990). Thus, in this case the 2 major candidate functions could lead to speculation of the correct coenzyme. 3sdh Hemoglobin is a plasma protein involved in the transportation of oxygen. It’s oxygenbinding capacity is revealed by almost a third of its hits being globins, which show poor sequence to the target domain yet still possess the necessary scaffolding to accompany the heme group, the necessary component for oxygen transportation. It is possible light-harvesting proteins (“LHP”s) turned up as a candidate, yet very minor, candidate function due to the ligands that they bind. One of the three binds chlorophyll a, whose ring system has similarities in shape to the porphyrin ring system of heme, and contained magnesium ion is of the same charge as the iron contained in heme. Perhaps a homology is revealed which shows conservation for binding to these ligands. 1vih Vigilin consists of 14 repeats of the KH module, a domain known for its RNA-binding properties (Musco et al., 1996). It was recently confirmed that vigilin binds to vitellogenin mRNA (Dodson & Shapiro, 1998). Upon inspection of the candidate functions, it is very possible to use the major candidate function of nucleic acid-binding to guess that this target domain binds a similar molecule. 1hoe Alpha amylase inhibitor prevents alpha-amylase from catalyzing the hydrolysis of glycosidase linkages in starch and similar polysaccharides by occupying the receiving region of the enzyme, imitating substrate interactions with the enzyme at specific residues, and causing structural changes in the active site (Bompard-Giles et al., 1996). 2wrp-r Tryptophan (trp) repressor binds to specific DNA sites and is involved in gene regulation. While the sites that this molecule binds to are not yet suggested by the major candidate function, its DNA-binding nature is. Perhaps, upon examination of the binding features of the hits that lead to the development of the transcription regulation candidate function, it can determined what are the characteristics of the DNA to which it would bind. 1fps Farnesyl diphosphate (FPS) catalyzes a hydrocarbon chain elongation reaction using isopentyl diphosphate and dimethylallyl diphosphate (Cunillera et al., 1997; Tarshis et al., 1996). Hydrocarbon-binding is suggested by the major candidate function, and its magnesium-binding is suggested by the next less popular candidate function. 1gnd Guanine nucleotide dissociation inhibitor (alpha isoform, GDI) binds to Rab proteins and prevents GDP from dissociating from these GTPase proteins (Wu et al., 1996; Shalk et al., 1996). It is accepted that GDI shares high structural similarity with various flavoproteins (Wu et al., 1996), yet it does not show binding activity with flavin-based molecules in the expected regions suggested by motifs and structure (Shalk et al., 1996). Conclusion We found that 63% of the statistically insignificant sequence similarity, structure hits had the reported functions that were used to derive the candidate functions that proved helpful in determining the biochemical/cellular functions of the target domains. We also found that upon examination of the candidate functions, the ones that appeared with great frequency (30-40% or greater) within each of the individual searches proved to be helpful. While it is not likely that all of the useful hit structures are not homologues of their target domains, the robustness of the ability of structure to predict function is similar to a recent study showing that structural comparison had a superfamily (same fold, similar function) homology detection rate twice as good as sequence comparison when both were applied to the same population of homologous proteins and leveled for the same error rate (Levitt & Gerstein, 1998). Our work ties in very well with the recent emphasis placed by structural biologists on solving the structures of orphan gene products for the discovery of new protein folds (Pennisi, 1998). The theory that non-radical degeneracy in amino acids is tolerated by evolution when creating similar structures and thus function (Holm & Sander, 1996), with which our work agrees, predicts that a proportion of the structures of the orphan gene products can be successfully compared with existing structures that have been functionally characterized. The functional insight gained by this approach can be used to screen for a small fraction of biological assays necessary to measure the function(s) of these proteins. In addition, with the increase in the novelty, quantity, and diversity of solved protein structures, this "structure-based approach to functional genomics" will grow in strength and value. Acknowledgement We thank M. Mehnert for designing PERL scripts that were used to check much of the compiled data. References Abola, E.E., Bernstein, F.C, Bryant, S.H., Koetzle, T.F., and Weng, J. (1987) in Protein Data Bank, in Crystallographic Databases - Information Content, Software Systems, Scientific Applications (Allen, F.H., Bergerhoff, g., and Sievers, R., eds.) pp.107-132 Data Commission of the International Union of Crystallography, Bonn/Cambridge/Chester. Apweiler, R., Gateau, A., Contrino, S., Martin, M.J., Junker, V., O'Donovan, C., Lang, F., Mitaritonna, N., Kappus, S., Bairoch, A. (1997) in ISMB-97 Proceedings 5th International Conference on Intelligent Systems for Molecular Biology pp 33-43, AAAI Press, Menlo Park. Bairoch, A. and Apweiler, R. (1998) NAR 26, 38-42. Basran, J., Casaratto, M.G., Basran, A., and Roberts, G.C. (1997) Protein Engineering 10, 815826. Bernstein, F.C., Koetzle, T.F., Williams, G.J.B., Meyer Jr., E.F., Brice, M.D., Rodgers, J.R., Kennard, O., Shimanouchi, T., and Tasumi, M. (1977) J. Mol. Biol. 112, 535-542. Bompard-Giles, C., Rousseau, P., Rouge, P., and Payan, F. (1996) Structure 4, 1441-1452. Botstein, D., Chervitz, S.A., and Cherry, J.M. (1997) Science 277, 1259-1260. Brenner, S., Chothia, C., Hubbard, T.J.P., and Murzin, A.G. (1996) Meth. Enzymol. 266, 635642. Bystroff, C., Oatley, S.J., and Kraut, J. (1990) Biochemistry 29, 3263-3277. Bystroff, C. and Kraut, J. (1991) Biochemistry 30, 2227-2239. Casari, G., De Daruvar, A., Sander, C., and Schneider, R. (1996) Trends Genet. 12, 244-245. Castiglione Morelli, M.A., Stier, G., Gibson, T.G., Joseph, C., Musco, G., Pstore, A., and Trave, G. (1995) FEBHS Lett. 358, 193. Chothia, C. (1992) Nature 357, 543-544. Cook, W.J., Jeffrey, L.C., Sullivan, M.L., and Vierstra, R.D. (1992) J. Biol. Chem. 267, 15116. Cunillera, N., Boronat, A., Ferrer, A. (1997) J. Biol. Chem. 272, 15381-15388. Davies, J.F.2nd, Delcamp, T.J., Prendergast, N.J., Ashford, V.A., Freisheim, J.H., and Kraut, J. (1990) Biochemistry 29, 9467-9479. Dodson, R.E. and Shapiro, D.J. (1998) Mol. Cell. Biol. 18, 3991-4003. Gargaro, A.R., Frenkiel, T.A., Nieto, P.M., Birdsall, B., Polshakov, V.I., Morgan, W.D., and Feeney, J. (1996) European Journal of Biochemistry 238, 435-439. Garret, R.H. and Grisham,C.M. (1995) Biochemistry p 498 (Saunders College Publishing and Harcourt Brace College Publishers, Orlando, NJ). Gunsalus, R.P. and Yanofsky, C. (1980) Proc. Natl. Acad. Sci. USA 77, 1980. Holm, L. and Sander, C. (1993) J. Mol. Biol. 223, 123-138. Holm, L. and Sander, C. (1994) NAR 22, 3600-3609. Holm, L. and Sander, C. (1996) Science 273, 595-602. Holm, L. and Sander, C. (1997) NAR 25, 231-234. Holm, L. and Sander, C. (1998) Bioinformatics 14, 423-429. Huang, S., Appleman, R., Tan, X.H., Thompson, P.D., Blakley, R.L, Sheridan, R.P., Venkataraghavan, R., and Freisheim, J.H. (1990) Biochemistry 29, 8063-8069. Hubbard, T.J.P., Murzin, A.G., Brenner, S.E., and Chothia, C. (1997) NAR 25, 236-239. Joachimiak, R.A., Schevitz, R.W., Kelley, R.L., Yanfosky, C., and Sigler, P.B. (1983) J. Biol. Chem. 258, 12641. Joachimiak, A., Kelley, R.L., Gunsalus, R.P., Yanofsky, C., and Sigler, P.B. (1983) Proc. Natl. Acad. Sci. USA 80, 683. Lawson, C.L., Zhang, R.-G., Schevitz, R.W. Otwinowski, Z. Joachimiak, A., and Sigler, P.B. (1988) Proteins, Structure, Function, Genetics 3, 18. Levitt, M, and Gerstein, M. (1998) Proc. Natl. Acad. Sci. USA 95, 5913-5920. Lima, C.D., Klein, M.G., and Hendrickson, W.A. (1997) Science 278, 286-290. Lipman, D.J., and Pearson, W.R. (1985) Science 227, 1435. Murzin, A., Brenner, S.E., Hubbard, T., and Chothia, C. (1995) J. Mol. Biol. 247, 536-540. Musco, G., Stier, G., Joseph, C., Castiglione Morelli, M.A., Nilges, M., Gibson, T.J., and Pastore, A. (1996) Cell 85, 237-245. Parge, H.E., Arvai, A.S., Murtari, D.J., Reed, S.I., and Tainer, J.A. (1993) Science 262, 387. Pearson, W.R. and Lipman, D.J. (1988) Proc. Natl. Acad. Sci. USA 85, 2444. Pearson, W.R. (1996) Meth. in Enzymol. 266, 227-258. Pennisi, E. (1998) Science 279, 978-979. Pflugrath, J.W., Wregand, G., Huber, R., and Vertesy, L. (1986) J. Molec. Biol. 189, 383-386. Prendergast, N.J., Delcamp, T.J., Smith, P.L., and Freisheim, J.H. (1988) Biochemistry 27, 3664-3671. Rancourt, S.L. and Walker, V.K. (1990) Biochem. Cell. Biol. 68, 1075-1082. Read, R.J. and James, M.N. (1988) J. Molec. Biol. 200, 523-551. Royer, W.E.J., Hendrickson, W.A., and Chiancone, E. (1989) J. Biol. Chem. 264, 21052-21061. Royer, W.E.J., Hendrickson, W.A., and Chiancone, E. (1990) Science 249, 518-521. Royer, W.E.J. (1994) J. Molec. Biol. 235, 657-681. Schalk, I.J., Stura, E.A., Matteson, J., Wilson, I.A., and Balch, W.E. (1994) J. Molec. Biol. 244, 469. Schalk, I., Zeng, K., Wu, S.K., Stura, E.A., Matteson, J., Huang, M., Tandon, A., Wilson, I.A., and Balch, W.E. (1996) Nature 381, 42. Schevitz, R.W., Otwinowski, Z., Joachimiak, A., Lawson, C.L., and Sigler, P.B. (1985) Nature 317, 782. Shalk, I., Zeng, K., Wu, S.-K., Stura, E.A., Matteson, J., Huang, M., Tandon, A., Wilson, I.A., and Balch, W.E. (1996) Nature 381, 42-48. Tarshis, L.C., Yan, M., Poulter, C.D., and Sacchettini, J.C. (1994) Biochemistry 33, 1087110877. Tarshis, L.C., Proteau, P.J., Kellogg, B.A., Sacchettini, J.C., and Poulter, C.D. (1996) Proc. Natl. Acad. Sci. USA 93, 15018-15023. Wu, S.-K., Zeng, K., Wilson, I.A., and Balch, W.E. (1996) Trends in Biochemical Sciences 472476. Zhang, R.-G., Joachimiak, A., Lawson, C.L., Schevitz, R.W., Otwinowski, Z., and Sigler, P.B. (1987) Nature 327, 891. Zheng, J., Chen, Y.Q., Callender, R. (1993) European Journal of Biochemistry 215, 9-16. Table 1: Summary of the frequency of occurrence of candidate functions consistently appearing as keywords for poor sequence similarity structural hits of target domains a 1sgt - hydrolase, serine protease, zymogen 8/8 hydrolases 8/8 proteases 7/8 serine proteases 1/8 thiol protease 4/8 zymogens 1dhf-a - oxidoreductase, reductase, dehydrogenase, NADPH-binding 12/14 purine-based nucleotide-binding 2/14 NADP+ or NADPH binding 2/14 GTP-binding 7/14 ATP-binding 7/14 phosphorylation or kinases 3/14 oxidoreductases 3sdh - oxygen transport 6/19 oxygen transport or storage 3/19 acetylation or methylation 3/19 light-harvesting protein 3/19 DNA-binding 1vih - ribonucleoprotein, RNA-binding 14/29 nucleic acid-binding 9/29 DNA-binding 7/29 RNA-binding 7/29 transferases nucleotidyltransferases, phosphotransferases, acetyltransferase, glycosyltransferase 5/29 transcription regulation or stimulation 3/29 acetylation and/or methylation 1hoe - glycosidase inhibitor, alpha amylase inhibitor, disulfide linkages, mimicry of polysaccharides 26/33 carbohydrate-binding or carbohydrate-bound proteins 11/33 glycoproteins 6/33 cell adhesion proteins 6/33 immunoglobulins 3/33 glycosidases 3/33 anti-glycosidases, glycosyltransferase, acetyltransferase 8/33 Immunoglobulin fold proteins 3/33 calcium-binding 2wrp-R - DNA-binding, transcription regulation, repressor, protects DNA from endonuclease activity 27/37 DNA-binding 17/37 transcription regulation 5/37 repressor 7/37 activator 11/37 nuclear proteins 1fps - prenyltransferase, isoprene biosynthesis, isoprenoid-binding, isoprenoid elongation, magnesium-binding 21/67 carbohydrate-binding or carbohydrate-bound proteins 11/67 glycoproteins 7/67 carbohydrate transferase methyltransferase, glycosyltransferase, sugar transport, pentosyltransferase, methylation 20/67 metal-binding 8/67 heme-binding 7/67 iron-binding bind to magnesium, calcium, cobalt, zinc, vanadium 15/67 nucleic acid-binding 11/67 DNA-binding 9/67 nucleotide-binding 1gnd - GTPase activation inhibitor, Rab protein-binding, guanine nucleotide dissociation inhibitor 54/80 bind to nucleotides, nucleosides, or nucleotide composed molecules 49/80 bind to purine-based molecules 45/80 bind to adenine-based molecules 34/80 bind to NAD-based molecules 11/80 bind to FAD 38/80 oxidoreductases 13/80 metal-binding 6/80 zinc-binding 3/80 bind to pyrimidine-based molecules thymine-binding, uracil-binding 3/80 nucleic acid-binding a PDB id codes are included for each of the target domains, and their functions follow the hyphen. It is important to note that this represents a summary of how often a certain function was reported for a PDB entry. If a protein has two candidate functions, it is counted once for each of the candidate functions among the hits of a target domain. 2aak and 1cks-a were not included due to lack of poor sequence similarity hits. Figure Legends Figure 1: Sequence similarity categorization for nonredundant structural similarity hits. The columns show results for searches done with each of the target domains, which are represented by their PDB id codes (including their chain designations if applicable). The sequence similarity values were determined by searching against the NCBI NR database with FASTA. The E scores obtained from FASTA were used to indicate likelihood of finding the hit through sequence similarity search as follows: highly probable E 0.02, twilight zone 0.02 < E 10, not probable E > 10. Cutoffs for categories were determined from literature survey (Pearson, 1996), and possible homology detection via sequence similarity search is given the benefit of the doubt in the fringe area between “Not Possible” and “Twilight Zone” (personal communication with Bruccoleri, R.E.). Dot density corresponds to probability in determining possible homology via sequence similarity. Figure 2: All nonredundant structural hits of 10 target domains categorized by sequence similarity. The whole pie represents all of the 398 nonredundant structural hits obtained from the 10 DALI searches. Figure 3: z score vs. log (E score) for all nonredundant structural hits. HP: highly probable detection of similarity by sequence, TZ: in the “twilight zone” of detection by sequence similarity, NP: not probable detection of similarity by sequence. Each of the symbols represents structural hits obtained from comparison of the PDB entry indicated in the legend box. The numbers next to the PDB id codes in the legend box are, in order, the number of points for each search, the z score received for the target perfectly matched against itself, and the length of the target. The PDB entries that were not detected during FASTA searches (E score > 100) were assigned an E score of 100. PDB entries that received an E score > 10-25 were assigned an E score of 10-25. Twilight Zone 5% Not Probable 72% Highly Probable 23% Figure 2 The End