Structural genomics of conserved gene families

advertisement
Structural genomics of conserved gene families
By Ram Mani
Advisor: Gaetano T. Montelione, Ph.D.
Center for Advanced Biotechnology and Medicine;
Department of Molecular Biology & Biochemistry,
Rutgers, The State University of New Jersey -Piscataway, NJ
In fulfillment of requirements for the
Henry Rutgers Scholars Program
May 2000
Acknowledgements
I have been working in Dr. Gaetano Montelione's laboratory since May 1997, which
makes the number of persons to thank very large. If someone is excluded please forgive me.
Members of Dr. Montelione's lab have been tremendously supportive. Rong Xiao has
taught me so much during my protein expression and purification work. Daphne Palacios has
been an excellent teacher during much of the molecular biology work of my COG project. Dr.
Kristin Gunsalus provided essential coordination during my work with COG's, and her protein
expression kit has been a valuable resource. Several others (e.g. Drs. Albert Chien and Parag
Sahasrabudhe) were always willing to answer my questions about DNA and protein preparation.
Without the work of Alexandra Gardino and Charles Lu, my work on COG 0229S would
not have been possible. Emily Ly was kind enough to share in my molecular biology work in
COG 0316S. Gurmukh Sahota has been very gracious in helping me in times of need.
Dr. Montelione and Dr. Stephen Anderson have been kind enough to serve on my thesis
presentation committee and deal with delays that came with the composition of this thesis. Both
have provided valuable assistance during my computational work.
Last, but not least, I would like to recognize the overall direction that Dr. Montelione has
provided me during all of my projects. His intelligence and creativity and been large motivating
factors for my work.
Table of Contents
Chapter 1
Introduction
Experimental Methods
Computational Biology
Molecular Biology
Biochemistry
Results and Discussion
Conclusion and Future Work
References
Legends
Figures and Tables
Chapter 2
Abstract
Introduction
Experimental Methods
Results and Discussion
References
Table
Figure Legends
Figures
Page 1
5
10
20
30
41
43
45
unpaginated
53
54
55
59
66
70
75
unpaginated
Preface to the thesis
This thesis was written as two chapters, each written independently of the other. Chapter
one describes work that I have done during investigation of three domain families of unknown
function -- domains 0011, 0229, and 0316. Many of the gene products that populate these
families are from predicted open reading frames from genome sequencing projects. The basis
for these domain families was a gene product classification scheme called "clusters of
orthologous groups of proteins (COG's)" (Tatusov, R.L., et al., 1997). What makes some of
these domain families special is that they are conserved among several organisms of different
phlya. For those families that are represented in ancient and recent organisms of evolution,
Darwinian theory allows us to assume that they may be fundamental to life. Computational
biology, molecular biology, and biochemistry techniques are being used to intelligently choose
targets within these domain families and ultimately discover the biochemical functions of
domains 0229 and 0316. We believe that by understanding the structure of each of these
domains, we stand a very good chance of understanding their function.
The study described in chapter two partially justifies our structure-based approach to
functional-genomics. The study probes the question: if a structure was obtained of a protein of
unknown function and having no homologues detected via sequence comparison, would
structural comparison to the Protein Data Bank structures result in useful, functional insight? By
simulating this scenario, we see that structural comparison provides useful clues when sequence
comparison fails for eight out of ten randomly chosen cases. This result gives more reason to
allow the work of chapter one to be one of the foundations for a structural genomics approach of
significant potential.
Chapter 1: Structural genomics of conserved gene families.
Introduction
At a March 1999 press conference, Vice President Al Gore said:
I am extremely pleased that the Human Genome Project (HGP) has accelerated
efforts to complete one of the most important scientific projects in human history
-- unlocking the secrets of the genetic code. The Project will forever change how
we understand the human body and disease, leading to improved prevention,
treatments, and cures for what are currently medical mysteries
(http://www.ornl.gov/hgmis/project/update.html).
Gore revealed a fundamental point in his statement: the tremendous scientific and medical
benefits that we will gain from the genome sequencing projects of humans and other organisms
rest upon how we choose to understand the data obtained from these projects (Tilghman, S.M,
1996). To this end, our lab has chosen to contribute to the interpretation of this data by using
nuclear magnetic resonance (NMR) spectroscopy to determine the protein structures coded by
genes shared by different organisms. Target selection and structure determination applied across
one or more genomes represents a new area of science called structural genomics (proteomics).
Structural genomics (Montelione, G.T. & Anderson, 1999) is the field concerned with
determining the three-dimensional structures of proteins coded by genomes and using this
structural insight to understand the biochemical and cellular functions of these proteins, i.e.
functional genomics (Feng, W., et al. , 1998; Hwang, K.Y., et al., 1999). However, with this
"genome approach" to structure determination, one can easily be overwhelmed by the
tremendous number of potential protein targets. The human genome alone consists of eighty
thousand to one hundred thousand genes, and the process of determining a protein structure by
NMR or X-ray crystallography traditionally takes months to years. We have devised a set of
criteria that may result in the discoveries of protein structures of significant impact on the
structural biology community. Specifically, we consider: gene products that show significant
sequence similarity (homology) to gene products of different organisms, with the organisms
being evolutionary diverse and some being Metazoan (e.g. H. sapiens, M. musculus, C. elegans);
gene products that are of suitable size and predictive solubility for NMR spectroscopy; and gene
products that have no known biochemical/cellular function. The first criterion allows that the
gene may be fundamental for life since it is found in a majority of the different kingdoms of life,
no matter how primitive or sophisticated. Another benefit of the first criteria is that investigating
a "family of homologues" can be done in a parallel manner; thus, if one protein shows poor
biophysical properties, work on the family can be continued because "cousin" proteins may
provide better results.
An excellent foundation for our investigation is clusters of orthologous groups of proteins
(COG's). This is a classification scheme (Tatusov, R.L., et al., 1997) of homologous genes from
an evolutionary diverse set of organisms whose genomes were completely sequenced. A COG is
defined as a set of 3 or more evolutionarily diverse best hits from genome versus genome
sequence comparisons. A fundamental assumption is that each COG represents a structural
domain family. A structural domain is a piece of a protein that folds independently and has been
proposed to be a protein module whose DNA moves throughout the course of evolution (Patthy,
L., 1996). Because many domains retain stability without their parent proteins, many structural
studies examine a protein one isolated domain at a time. NMR structure studies often are
focused on single structural domains rather than multi-domain proteins because of the 30 kDa
size limitations for NMR data interpretation.
In addition to intelligent target selection, the structural genomics effort depends upon
rapid and high quality structural determination methods. X-ray crystallography and NMR
spectroscopy are two complementary methods. X-ray crystallography makes use of X-ray
diffraction upon crystallized macromolecules to determine the structures of those molecules.
Proteins in the range of several of hundreds of kilodaltons molecular weight can be examined
through X-ray crystallography. Although a major bottleneck is the crystallization of these
macromolecules, burgeoning techniques and technology are making the rapid acquisition of
high-quality, atomic structures realistic for X-ray crystallography (Terwilliger, et al., 1998)
NMR complements the role of X-ray crystallography by allowing for the investigation of
molecules in their native, solution state. NMR also allows structural investigation for some of
those proteins that do not crystallize. Rapid acquisition of high-quality, atomic structures is
closer to reality due to a multitude of developments in technique and technology (Montelione &
Anderson, 1999). The bulk of the data for the structure will come from triple resonance
experiments. Software (SPARKY: http://www.cgl.ucsf.edu/home/sparky/) from a collaborating
group allows for rapid acquisition of peaks from NMR spectra. In addition, our lab has
developed software to automate the task of derivation and analysis of resonance assignments
(Montelione, G.T., et al., 1999). With high-resolution structures, we will be in a better situation
to test the biochemical/cellular function of domains of unknown function. Even though the
domains being investigated show no sequence similarity to proteins of known function,
similarities that the structure may show to protein structures of known function can act as a guide
in testing function. This may be possible because structure is preserved much better than
sequence throughout the course of evolution (Holm, L. & Sander, C., 1996; Levitt, M. &
Gerstein, M., 1998; Holm, L. & Sander, C.,1997; Lima, C.D., et al., 1997).
The benefits from this project can be many. We can begin to understand the
biochemical/cellular function of conserved domains. If the structures of these domains are
dissimilar to anything known -- the probability of this enhanced because the lack of sequence
similarity to anything of known structure -- they may serve as members of the collection of
known protein folds existing in nature. The members of this database, which is estimated to be
composed of some one thousand to two thousand unique folds, can be used as structural and
functional models for the many sequences coming from the genome projects (Chothia, C., 1992).
This will assist in providing novel templates for homology-modeling projects. In addition,
because many of these COG domains are present in so many different organisms, knowledge
gained about them will benefit evolutionary studies. The presence of these domains in model
organisms can be exploited to elucidate the functional roles of these domains through
experiments designed for those systems (e.g. RNA interference in C. elegans).
With this is mind, work on two of three domain families has supplied us with candidates
for structural determination projects. COG 0011S showed no representation in Metazoa, and
therefore work on it has been stopped after the computational biology stage. Protein purification
and biophysical studies for COG 0229S will be examined in this thesis. Experiments show this
domain to be a candidate for structural determination via NMR. Computational biology,
molecular biology, protein expression, and protein purification studies for COG 0316S will be
also be examined. Experiments show that this domain may not be suitable for NMR, but rather a
candidate for X-ray crystallography.
Experimental Methods
Computational Biology
A list of COG's of potential suitability for NMR was provided to the lab (Eugene Koonin,
NIH). These 15 COG's contain conserved regions 49 to 382 residues in length (the majority are
less than 200) and possessed no predicted transmembrane regions. They were of the R or S
classes -- function not well understood or function unknown, respectively. The following two
COG's (0011S and 0316S) were subjected to a general computational biology protocol (figure 1).
These two COG's were among the first analyzed and expanded by the lab. They were the chosen
first because of their apparent short domain lengths. A step-by-step listing of the protocol is
found at:
http://www-nmr.cabm.rutgers.edu/bioinformatics/cogs/protocol/protocol2.html
Since this work, members of this lab have developed a more effective and efficient protocol.
Acquisition and Initial Analysis of Original Sequences in COG
COG protein sequences were obtained from the NCBI site
(http://www.ncbi.nlm.nih.gov/COG/) and were analyzed for regions of conservation and
potential domain ends, i.e. end or begin of a conserved region. This was done by creating a
multiple sequence alignment (msa) of all the protein sequences listed under the COG. The
algorithm and program CLUSTALW (Thompson J.D., et al., 1994) of the software suite NCSA
Biology Workbench version 3.0 provided by the University of Illinois
(http://biology.ncsa.uiuc.edu/) was used to generate the msa. To aid in visualization of
conserved residues and regions within the msa, the program Boxshade 3.3.1 (Kay Hofmann and
Michael D. Baron) in the same software suite was used to color conserved regions in and to
derive a consensus sequence for the msa.
Gathering of Possible Structural & Functional Information of COG
The consensus sequence provided from the msa was used to search the Protein Data Bank
for structural homologues. The PSI-BLAST online program (Altschul, F., et al., 1997) was used
to search NCBI's local PDB sequence database (E score < 10, noncomplex regions remain
unfiltered). No structural homologues were found. In addition, the consensus sequence and some
of the original sequences were checked for transmembrane region features. No evidence of
transmembrane regions was found for the COG 0316S proteins. Likewise, a search was
conducted for functional implications of the COG proteins. Primary reliance was placed upon
the Swiss-Prot database (Bairoch, A., 2000 -- http://www.expasy.ch/sprot/) because this is one of
the better annotated, protein databases. When applicable, Medline abstracts and articles were
consulted to discern function of proteins.
Expansion of COG's
After the initial analysis, the next step was to see the if the COG domain was represented
in organisms with genomes that were not included in the COG scheme of Tatusov and
colleagues, i.e. those genomes that are not completely-sequenced. It should be noted that this
"expansion" of the COG actually breaks down the restrictions placed on the NCBI scheme.
Tatusov and colleagues used the "best-hit among completely-sequenced genomes" criteria to
insure that they were creating a "tightly-knit" group of proteins that were homologous to each
other. By comparing completely-sequenced genomes, one could determine orthologues from
paralogues. Since genomes not completely sequenced were examined in our expansion, we are
not sure if orthologues or paralogues were added to expanded COG. This uncertainty has
implications in conservation of function. However, our primary concern was to develop a more
populated domain family. Our basis for adding homologous proteins containing a domain
common to the COG were the following set of criteria: statistical significance of similarity
between sequences (E score less than or equal to ~10-2, good alignment between sequences for
the region assumed to constitute the domain, sequence being a neighbor rather than outlier in the
phylogenetic tree of the COG, and sequence possessing the identical or similar residues that were
well-conserved in the msa. To search for these possible additions to the COG, the program
HMMER (Sean Eddy, http://hmmer.wustl.edu/) was used to search the nonredundant (nr) protein
database of NCBI. HMMER works by creating a hidden Markov model statistical representation
of an inputted multiple sequence alignment. This profile is then compared to each of the
sequences in the queried database to look for possible matches. In addition to this search,
PRODOM (Corpet, F., et al., 1998) and Prosite (Hoffman, K., et al., 1999) were searched to see
if additional proteins may be added to the domain family. These homologues were also
inspected for function by consulting the Swiss-Prot database and necessary literature.
The non-redundant protein database lacks translated expressed sequenced tag (EST)
sequence. Currently, the EST "shotgun" approach to sequencing is being used to sequence many
of the genomes from eukaryotes. Thus, to consider many of the homologous proteins from
higher organisms, one needs to search EST’s. Since HMMER cannot be used to search nucleic
acid databases, BLAST was used to search the NCBI database of EST's (dbest). The consensus
sequence from the most recent msa was used as the query sequence, which allows for the capture
of some of the additional conservation information of the expanded COG rather than the original
COG. A BLAST module that compares protein sequence to a db of nucleic acid sequences
translated into the six possible gene products was used as the search program. EST’s statistically
significant to the query (E score < 10-2) were searched against the NCBI nr and EST databases to
see if overlapping nucleic existed. This was done because EST's are often not full length or may
contain sequencing errors in relative high frequency. In addition, the Unigene database (Schuler,
et al., 1996) of NCBI was consulted because it contains updated groups of overlapping EST's in
the public domain for mouse and human. After all possible overlapping sequences were
acquired, the assembler CAP (Huang, X., 1996) at the Baylor College of Medicine (BCM)
website was used to build a more accurate and longer cDNA sequence. This assembled "contig"
was then translated into the six possible gene products (with attention paid to stop codons) using
the BCM 6-frame translater (http://dot.imgen.bcm.tmc.edu:9331/seq-util/seq-util.html). These
resultant protein sequences were compared to the msa consensus to determine which translation
may code for a possible homologue. These translations were accepted or rejected using similar
criteria explained previously.
Choosing to express domain
With more information now available about this domain, we were in better position to
choose whether it should undergo a structural determination project. It was decided to pursue
domains that were: conserved in a large number of phylogenetically diverse organisms, found in
Metazoa, and most appropriate for NMR (small size and no predicted coiled-coils). COG 0011S
was excluded from further investigation because of its absence in Metazoa. COG 0316S fulfilled
the previously listed criteria. Its domain termini were reevaluated by looking at the more highly
populated msa to better determine which residues may be more critical for structural and/or
functional integrity. With the philosophy that we should express several of the members’
domain sequences from the COG in case some of the proteins give problems during our
investigation, six domain sequences were chosen for expression. This selection was done by
inspecting the phylogenetic tree for "subgroups" of the domain sequences. The tree is a
graphical representation of a distance matrix comparing the evolutionarily differences between
pairs of sequences in the msa. The distance between sequences in the tree represent how
evolutionarily different in sequence those sequences are; the nodes of the tree are an indication
of divergence of sequences. Assuming that sequences within a branch are most similar to each
other, we subgrouped sequences by which major branch they reside at. Attempts were made to
choose at least one sequence from each major branch of the tree. Selection of sequences within a
subgroup was done by availability of the DNA coding for that sequence, confidence in the
quality and length of reported cDNA coding for that sequence, and preferences of the scientific
community and/or the lab for study of the organism from which the DNA originates. Once the
sequences to express were chosen, the domain termini were finalized. Previous predictions of
domain termini and the revised msa were consulted. In addition, a secondary structure prediction
of the sequences to express was conducted using the algorithm and program PHDsec (Rost, B. &
Sander, C., 1993, 1994) using the online software suite PredictProtein (Rost, B., 1996). Domain
termini were revised so as to not interrupt predicted secondary structure elements.
Molecular Biology & Biochemistry ("Benchwork")
The next phase of the project was to express, purify, and analyze several of the domain
sequences for COG's. As explained before, it was decided to discontinue work on COG 0011S
and to continue work on COG 0316S. Thus, the molecular biology protocol (Figure 2) will refer
to steps taken in the preparation of expression vectors for the 316 domain sequences.
DNA Acquisition & Preparation
After determining which sequences to express from 316 domain family, DNA coding for
the protein sequences was obtained. Genomic DNA (gDNA) for the B. subtilus, H. influenza,
and S. cerevisiae sequences and cDNA for the C. elegans, M. musculus, and H. sapiens
sequences were obtained (Table 1). The gDNA had already been isolated, and it was provided to
us by neighboring labs (S. Anderson and S. Brill). cDNA was purchased from the vendor
Genome Systems or acquired from institutions (Dr. Yuji Kohara of Japan's National Institute of
Genetics). Since all cDNA was supported only by EST data, it was sequenced upon receipt.
DNA Acquisition & Preparation: 316 C. elegans (CEle1)
Isolating the cDNA for the C. elegans coding sequence required excision of the
pBluescript SK phagemid (-) from the lambda ZAP vector. Methodology for this process -which makes use of the ExAssist helper phage -- is described in the instruction manual for the
Lambda ZAP II Library (Stratagene). Once the pBluescript vector was isolated, the C. elegans
cDNA insert contained in it was sequenced by using the T3 and T7 primers that border it.
Sequence was verified against that assembled from the assembled EST's. This sequencing
demonstrated that the insert contained coding sequence whose translated product contained the
expected stop codon (i.e. it was complete in the portion coding for the C-terminus of the
complete protein). The sequencing also demonstrated that the insert contained coding sequence
upstream of that reported in the GenBank EST's. In other words, additional residues in the Nterminal portion of the protein were revealed. However, no methionines could be found in these
residues. This leads to conclusion that although more of the N-terminal portion of the protein
was revealed through our cDNA sequencing, the cDNA is probably too short to encode the entire
coding sequence. Recently (after the creation the expression vectors), a predicted gene product
almost identical to the 316 CEle1 sequence was deposited in the WormPep database (Sanger
Center & Washington University C. elegans genome project database). This sequence contains a
methionine residue at the N-terminus. This sequence also contains about 25 residues between
this N-terminal methionine and the beginning of the region of conservation for 316 domain
sequence (i.e. LTLT…). No signal peptide sequence could be found when running the C. elegans
WormPep sequence through a signal peptide search program (PredictProtein). Thus, our choice
for the beginning of the CEle1 domain may have been off by 20 residues if the structural domain
begins at the immediate N-terminus of the complete gene product.
DNA Acquisition & Preparation: H. sapiens (HSap2)
The 316 HSap2 cDNA arrived in the pT7T3D-Pac vector (Pharmacia) contained in
DH10B host bacteria -- all of this contained in an agar stab. The bacteria were plated, grown
overnight in LB+amp media, and the high copy plasmid was isolated by the Qiagen Miniprep
method. Sequencing was performed using T3 and T7 primers. The resultant sequence contained
a coding sequence similar to the contig assembled from the EST’s. Errors in the contig sequence
were corrected by consulting sequencing waveforms for both strands of the insert. The region of
high-quality sequence of the insert were translated into a gene product that was: (1) very similar
to the 316 domain sequence, (2) contained a stop codon at the expected position, and (3) did not
contain a possible initiator methionine, but it did have at least forty residues N-terminal to the
portion of the protein (IRLT…) that begins the region of high conservation for the 316 domain
sequence.
DNA Acquisition & Preparation: M. musculus (MMus2)
Progress on the MMus2 sequence was limited. It was ordered from Genome Systems
similarly as HSap2. However, upon receipt in the lab, the bacterial agar stab was stored at –
20°C. It should have been stored at +4°C or plated immediately. The freezing temperature
killed the bacteria in the agar. A new clone was sent for sequencing after plating, growing the
host cells overnight, amplification and purification of the plasmid containing the cDNA, and
preparing a sequence mixture (primers and template). The sequence showed this clone to
contain cDNA corresponding to a different gene; Genome Systems sent a clone that had an
identification number different in one digit than that of the correct clone. After lengthy dialogue
with Genome Systems, the company finally sent the correct clone. By this time, the project was
well into the molecular biology stage. Thus, it was decided to postpone working on this domain
sequence.
Note on sequencing
The DNA Synthesis and Sequencing Laboratory of UMDNJ-RWJMS conducted all DNA
sequencing for this project. They employ a dye-terminator PCR method of sequencing.
Sequencing readouts and waveforms were consulted as necessary. A waveform tracing is a plot
of the intensity of the fluorescence associated with the dye on each nucleotide type versus the
position of that nucleotide on the strand of DNA being sequenced.
Design of PCR primers
After the correct cDNA sequenced had been established, I was in a position to design
PCR primers to clone the desired coding sequence. While designing the PCR primers it was also
assumed that the gDNA for SCer2, BSub1, and HInf1 matched that in the GenBank database.
Primers were designed such that we could extract a coding sequence that would retain the ability
to be ligated into a multiplexed expression vector system developed by Dr. Kristin Gunsalus and
colleagues (Gunsalus, K.C., et al., submitted). Gunsalus' expression system has the goal of
creating a single PCR product that may undergo different sets of digests so that it may be ligated
into nine expression plasmids (Figure 3). The plasmids differ in location of a hexa-histidine
fusion tag that is exploited for purification purposes. They also differ in type of promoter
employed during transcription of the coding sequence so that yield of recombinant protein could
be optimized. Specific restriction sites needed to be designed into the 5' and 3' coding sequence
for each domain sequence so that each PCR product would have the capability of being inserted
into each of the nine vectors.
During the course of the investigation, it was decided that 2 types of PCR product would
be made for each domain sequence -- one containing the RE2 site and one excluding the RE2
site. This was done so that the resultant domain sequences expressed as the N-terminal hexahistidine tag or nonfusion constructs would retain their native sequences in C-terminal portion of
the domain sequence. Thus, one forward primer and two reverse primers were designed for each
domain. The following describes the naming scheme for PCR products: name-1 for the PCR
product coding for the C-terminal hexa-histidine fusion tag domain sequences; name-2 for the
PCR product coding for the N-terminal hexa-histidine fusion tag and the nonfusion domain
sequences. For example, BSub1-2 refers to PCR product coding for the N-terminal hexa-
histidine fusion tag and the nonfusion domain sequences of the BSub1 domain from COG
0316S.
In addition to the proper restrictions sites being designed into each PCR primer:
1) insertion of 2 stop codons stop codons was designed into each reverse primer to
protect against read-through by E. coli's translational machinery.
2) primers were made long enough and with high enough GC content so that proper,
specific annealing could occur. Melting temperatures were in the range of 50 to
70°C.
3) Rare codons in the first approximately ten amino acids of the domain sequence were
“designed out” for their more prevalent synonymous codons. This was done to
prevent disruptions in the translational machinery, which may affect integrity or yield
of protein. When these changes in primer versus template sequence occurred,
approximately an additional 10 nucleotides were incorporated into the downstream
sequence to insure proper annealing of the overall primer to the template.
4) The primer was extended to include a "CC", "GG", "CG", or "GC" -clamp at its 3'
end. This was done to insure proper annealing at this critical area. Secondary
structure formation in this area could cause mutations (of the addition type) in the
PCR product.
5) Primer-check software at the Whitestone, Inc website was used to check the primer
sequences for secondary structure, melting temperature, and primer-dimer formation.
6) The DNA to code for the domain sequences was checked for restriction sites
commonly used in digesting products for ligation into vectors of the Gunsalus
expression kit.
In the case of the HSap2 domain sequence, only one PCR product was made because the
restriction site (RE2) for the C-terminal hexa-histidine fusion tag accommodates the protein’s
naïve sequence. BSub1, HInf1, and SCer2 also underwent primer design as described above.
PCR and PCR Product Purification/Concentration
Primers were ordered from GenoSys or the RWJMS DNA Synthesis/Sequencing
Laboratory. Two separate PCR reactions (except for HSap2) were conducted for each 316
domain member to be expressed. The 100 ul PCR mixture consisted of: 1 ul template DNA (20
ng/ul), 1 ul of each primer (10 uM concentration), 10 ul of a 10X PCR Taq buffer, 8 ul of
dNTP's, 1 ul of Taq Polymerase, and 78 ul of dH20. PCR was conducted in the following
sequence: 10 cycles of 94°C for 1 minute, 45°C for 1 minute, and 72°C for 2 minutes; 25 cycles
of 94°C for 1 minute, 60°C for 2 minutes, and 72°C for 5 minutes; and finally, 72°C for 5
minutes. The PCR mixture underwent electrophoresis on a 0.8% agarose gel, and the
appropriate bands were excised from gel. The PCR product was purified according to the
QIAEX II Agarose Gel Extraction protocol or the QIAQuick Gel Extraction protocol (Qiagen).
Concentration of the product was determined by UV visualization on gel and comparison
against standards. If it was too low, the PelletPaint protocol (Novagen) was used to concentrate
the purified PCR product.
Ligation, Amplification, and Purification of PCR product
To amplify the PCR product and to allow more efficient digestion of the insert using the
proper restriction enzymes, the PCR product was ligated into a high copy plasmid. A
pBluescript plasmid with overhanging T's was the first choice, and a DNA Rapid Ligation kit
and protocol was used. Considerable time (1-2 months) was spent trying to ligate the plasmid
into this vector. Insert to plasmid molar ratio was increased in reactions from 3:1 to 6:1 to 9:1
and finally to 12:1. New PCR product was synthesized. Fresh Taq Precision Polymerase Plus
enzyme and ligation mixture were used. Newly isolated pBluescript phagemid was also used.
However, among four lab members, none were able to generate a successful reaction. Controlled
tests could not determine what was the cause of failure. Consequently, a kit from Invitrogen that
allows ligation of PCR products into a stable, high copy plasmid (pCRII-TOPO) and
transformation of that plasmid into TOP10 cells was used. Instructions are found in the
Invitrogen manual. By use of colony PCR and/or restriction digest analysis, it was confirmed
that all PCR products (2 for each of the 5 organisms except H. sapiens) were ligated into the
pCRII-TOPO plasmid.
Each of the cloned inserts was sequenced to guard against errors in eventual expressed
protein. M13 and T7 primers were used because they border the multiple cloning site of the
pCRII-TOPO holding/amplification vector. The sequencing results showed point mutations in
several of the inserts in the portion coded by the primers and no errors present elsewhere in the
inserts. Table 2 shows example of errors that were found through sequencing of both strands of
the ligated PCR product. There seems to be no pattern in the errors that would lend support to a
sound explanation. It was hypothesized that a chemical modification (e.g. deamination) of a
purine would tend to make it appear to the DNA polymerase more like the other purine (and
similar explanation give for pyrimidine mutations). Consistent results supporting this hypothesis
were not evident. In some cases, when the same primer was used in two different PCR reactions,
different errors would occur in each of the reaction's PCR products. These error-ridden PCR
products resulted from reactions in which primers were used from GenoSys and in reactions in
which primers were used from the RWJMS DNA Synthesis Lab. After redoing some of the
reactions and observing that HSap2, CEle1-1, CEle1-2, and BSub1-1 were the only PCR
products of acceptable sequence, it was decided to focus on creation of expression vectors
corresponding to these PCR products.
The PCR products in the holding/amplification vector were allowed to amplify in E. coli
overnight at 37°C in 100 ml of LB culture. The Qiagen Midiprep protocol was used to isolate
and purify plasmid at the hundreds of ng/ul level. Concentrations were estimated by applying to
relationship that 50 ug of DNA equals 1 absorbance unit at wavelength = 260 nm).
Concentrations were also verified by running diluted plasmid samples on agarose gels and
comparing fluorescence of ethidium bromide-stained DNA to samples of known concentration.
Creation of expression vectors
Plasmids containing inserts were digested using the appropriate restriction enzymes so
that inserts may ligated into the selected expression vectors. 60 ul restriction digest reactions
were conducted as follows:
4-6 ug of DNA
Restriction enzyme 1
Restriction enzyme 2 or 3
Enzyme buffer + 10X BSA
dH20
Total reaction mixture
x ul
3 ul
3 ul
6 ul
60 - x ul
60 ul
Reactions were conducted at 37°C for approximately 5 hours. Reaction mixture was run on
0.8% agarose gels, the proper band was excised, and newly digested insert was purified using the
QIAQuick Gel Extraction kit and protocol. The concentration for purified inserts was tested, and
inserts were concentrated by using the methods described above in " PCR and PCR Product
Purification/Concentration." Expression vectors from the Gunsalus expression kit were also
digested, purified, and quantified for concentration using the same methods as when preparing
the inserts.
Ligation reactions were conducted using the DNA Rapid Ligation kit and protocol
introduced in the "Ligation, Amplification, and Purification of PCR product" section above.
Specifically, a 21 u1 reaction mixture was made consisting of:
DNA (100 ng) **
DNA dilution buffer (5x)
Ligation buffer (2x)
Ligase
x ul
10-x ul
10 ul
1 ul
Total reaction mixture
21 ul
** DNA consists of vector and insert at 3:1 or greater insert:vector molar ratio.
Reaction was run at room temperature for thirty minutes to four hours depending on the success
of previous reactions.
After the ligation reaction was assumed to have completed, 1 ul of the reaction mixture
was pipetted and gently stirred into a tube of 10 ul of NovaBlue competent cells (Novagen).
Transformation was attempted as according to the Novagen transformation protocol. If colonies
were present on plates after the overnight incubation of transformed cells, at least 4 colonies
were chosen to undergo analysis for completed expression vector. Theoretically, cells that grew
on the plates (with the appropriate antibiotic resistance) should contain the insert unless plasmid
was already in circular form.
Cells were lysed and underwent colony PCR to determine if the insert was contained in
the plasmid. If results were inconclusive, plasmid was isolated and examined by restriction
digest analysis to determine if insert was present. In necessary, the plasmid presumed to have
the insert was also run on gel and compared in size against a control plasmid of the same type
but without insert and undigested. For creation of some expression vectors, success was
immediate (e.g. ligation after the first reaction). For reactions in which no colonies were
produced after transformation or for reactions in which cells did not contain completed
expression vector, the ligation and transformation procedure was repeated. DNA amounts were
scaled up, insert:vector molar ratio was increased, and incubation time for ligation was increased.
Ultimately, all but one of the expression vectors were created for PCR products of acceptable
sequence (Table 3).
Biochemistry
The following describes all general and some specific methods used in protein
expression, purification, and analysis. These methods were applied to domain sequences from
the COG 0316S and 0229S families.
SDS-PAGE
17.5% SDS-PAGE gels were poured according to the following recipe. These 10 or 15lane gels were used as necessary for all experiments described in this paper. The following
recipe provides for 12 gels:
40% acrylamide mix
1.5 M Tris-HCl (pH 8.8)
1.0 M Tris-HCl (pH 6.8)
dH20
10% SDS
10% APS
TEMED
Total
Bottom gel (17.5% SDS-PAGE)
21.88 ml
12.50 ml
0 ml
14.6 ml
0.5 ml
0.5 ml
20 ul
50 ml
Stacking gel (4%)
2.0 ml
0
2.5 ml
15.1 ml
200 ul
200 ul
15 ul
20 ml
Protein samples were mixed with a 2x SDS-loading dye and B-ME mixture. This
mixture was boiled for 5 minutes. Samples were run at 150 mV in a SDS running buffer. A
BIORAD "Prestained SDS-PAGE Standards Broad Range" ladder was always run in one lane of
each gel to estimate molecular weight of bands. Often, lysozyme standards of known
concentration were run on gels to estimate quantity of protein in each band. After
electrophoresis, gels were washed in (2) 5-minute iterations of 50 ml of fresh dH20. Gels were
then stained for 10 or more hours using 20 ml of Coomossie Blue G-250 (Pierce). After the
staining period, gels were destained in 50 ml of dH20 for six or more hours to improve
resolution, contrast, and sensitivity. Gels were scanned by using the Montelione lab scanner and
Adobe PhotoShop. From March 2000 onwards, gels were scanned by using the CABM 2nd floor
gel photo-imaging equipment (much better quality than previous use of scanner).
Testing total protein expression (small-scale) via boil/chill cell lysis
To test for the total amount of target protein expressed in E. coli in various conditions,
cell lysate was analyzed on SDS-PAGE cells. Generally the procedure involved freshly
transforming the expression vector into the cell line (Novagen protocol). Cells were plated and
incubated overnight at 37°C. Multiple colonies were picked from plates, and each was grown in
4 ml of LB or minimal media (supplemented 2% glucose if vector contained T7 promoter) with
appropriate antibiotic. This overnight bacterial culture was used to innoculate 4 ml of fresh
media, which was incubated at 37°C, shaken, and monitored for OD600 by using the
spectrophotometers of the Montelione or Anderson labs. When OD600 of 0.5 to 0.8 was reached,
1 or 2 mM IPTG was added to induce expression. Cells were allowed to incubate at 37°C for 3
hrs. if in LB (5 hours if in minimal media) or at 27°C for 8 hrs. if in minimal media.
The recipe for minimal media used these in experiments consists of:
Ammonium sulfate
potassium phosphate (monobasic)
potassium phosphate (dibasic)
sodium citrate
dH20
magnesium sulfate (0.2 mg/ml)
20% glucose
Gibco Trace elements stock (10x)
Ampicillin (50 mg/ml)
Total
2.5 g
9.0 g
6.0 g
0.5 g
970 ml
5 ml
25 ml
1 ml
2 ml
1 liter
Cells were then spun down and the supernatant was decanted and pipetted away. For the
small-scale whole cell lysate evaluation, 500 ul of cells were centrifuged at ~15,000 x g.
Pelleted cells were stored at -20°C if necessary. To continue protocol for small-scale whole cell
lysate evaluation, cells were thawed on ice and subsequently resuspended and vortexed in SDS
loading dye and B-ME. 50 ul of dye/B-ME were used for every one unit of absorbance unit.
Cells were lysed by a 100°C boil for 5 minutes and chilled on ice for one minute. Cell debris
was spun down at ~15,000 x g for 5 minutes. 10 ul of supernatant was loaded onto SDS-PAGE
gels. Amount of protein was quantified by comparison against lysozyme standards.
Testing solubility of protein (small-scale) via B-PER
Testing target proteins for solubility refers to determining whether the recombinant
protein is found in the cytosol or inclusion bodies of the bacteria in which it is expressed. BPER (Pierce) is a detergent reagent that supposedly lyses cells and allows one to differentiate
between soluble and insoluble proteins. The procedure is similar to the above "Testing total
protein expression (small-scale) via boil/chill cell lysis" procedure until cells have been
centrifuged and frozen (if necessary). Then, cells from 1.5 ml of culture were resuspended in
300 ul of B-Per and vortexed until mixture was homogeneous. The mixture was centrifuged at
13,000 rpm for 5 minutes to separate insoluble (pellet) and soluble (supernatant) proteins. The
inclusion bodies (pellet) may be resuspended in 1 ml of 1:10 B-PER. Samples were analyzed via
SDS-PAGE.
Small-scale testing for solubility -- cell lysis via sonication
Sonication is another method in which cells may be lysed, and these results seem to be
more indicative of solubility than the B-PER tests. However, to sonicate by directly placing the
sonicator tip in cell culture, culture should be at least 1 ml in volume (for sonicator microtip of
Stock lab). Thus the following procedure was used. Cells were grown as indicated in "Testing
total protein expression (small-scale) via boil/chill cell lysis" section. The 4-ml overnight culture
was added in a 1:10 dilution to fresh 23 ml of media in a 250-ml flask. After inducing with
IPTG and incubating cells as described above, cells were spun down. The pellet was
resuspended in 1 ml of a native lysis buffer (50 mM NaPi, 300 mM NaCl 10 mM imidazole pH
8.0). The Stock sonicator with microtip was used to lyse the cells (setting 4, 2 minutes on, 20
seconds on / 20 seconds off cycles) in a 2.0 ml eppendorf tube that sits in an ice-water mixture
mixed with a stir bar. The cell lysate was spun down (insoluble phase), and 10 ul of the
supernatant (soluble phase) was analyzed by SDS-PAGE.
Large-scale protein expression and Ni+2-affinity chromatography purification -- denaturing
conditions (COG 316 N-term H6-tagged)
When it was observed that the 316 HSap2-14d-NtermH6 protein construct was
expressed in high yield and found in inclusion bodies, a scheme was devised to purify this
protein in denaturing conditions from one liter of 15N-enriched culture. The following protocol
was developed for protein expression and inclusion body isolation.
1) 4 ml of transformed BL21(DE3)pLysS cells were grown overnight in LB+amp. This was
added to 100 ml of 15N-enriched minimal media + amp and grown overnight at 37°C.
2) The 100 ml of culture was added to 900 ml of 15N-enriched minimal media + amp. This was
incubated at 37°C till OD600 reached 0.692. Expression was then induced with 2 mM IPTG
and allowed to incubate at 37°C for 5 hours (OD600 = 1.6).
3) Cells were centrifuged at 5000 x g for 30 minutes. Supernatant was discarded and cell pellet
was stored at -20°C. Protein from half of the cells (i.e. 450 ml of culture) was purified in the
following steps.
4) Cells were resuspended in 45 ml of native lysis buffer* and sonicated (setting 4, 10 minutes
on, 30 seconds on/off intervals).
5) Lysate was centrifuged at 10,000 x g, 30 minutes, and at 4°C. Supernatant was decanted.
6) Pellet was resuspended in 45 ml of 6M GuHCl pH 8.0 and sonicated as described in step 4.
7) The solution was centrifuged at 15,000 x g, 30 minutes, and at 4°C. Supernatant (inclusion
bodies) was saved and stored at 4°C.
8) Solutions and intermediates were analyzed via SDS-PAGE.
* Native lysis buffer is composed of 50 mM NaPi, 300 mM NaCl, 10 mM imidazole pH 8.0.
To exploit the hexa-histidine fusion tag at the N-terminus of the protein, the following
protocol was used to purify using Ni+2-affinity chromatography (from Rong Xiao).
1) 1 ml of Ni+2-NTA beads was added to the solution from step 7 presumed to contain the
protein isolated from inclusion bodies. Solution was incubated overnight at 4°C on a LazySusan.
2) A 20 ml BIORAD column was washed 2x with dH20 and equilibrated with one volume of
6M GuHCl pH 8.0.
3) Protein and Ni+2-NTA bead solution was added to the column. It was washed 3x with 6M
GuHCl pH 8.0.
4) Protein was washed with 5 ml of 6M GuHCl, 100 mM NaPi, 10 mM Tris-HCl pH 8.0
5) Protein was washed with 5 ml of 6M GuHCl, pH 6.3, 0.1 M Na2HPO4, 0.1M NaH2PO4.
6) Protein was washed with 5 ml of 6M GuHCl, pH 5.9, 0.1 M Na2HPO4, 0.1M NaH2PO4.
7) Protein was eluted with 10 ml of 6M GuHCl, 50 mM NaAc pH 4.2. Fractions were collected
in 1.5 ml eppendorf tubes.
8) Intermediates and elution fractions were analyzed via SDS-PAGE.
Large-scale protein expression and Ni+2-affinity chromatography purification -- native
conditions (COG 316 HSap2 C-termH6-tagged & COG 0229S Cele N-term H6-tagged proteins )
When it was observed that the 316 HSap2-23d-CtermH6 protein construct was
expressed in high yield and soluble, a scheme was devised to purify this protein in native
conditions from one liter of 15N-enriched culture. The following protocol was used for protein
expression in one liter of culture and Ni+2-affinity chromatography purification of protein (Rong
Xiao).
1) BL21(DE3)pLysS were tranformed with HSap2-23d-CtermH6 expression vector. Pick
colony and incubate at 37°C overnight in 4 ml of LB+amp.
2) 4 ml of overnight culture was added to 100 ml of 15N-enriched minimal media + amp and
grown overnight at 37°C.
3) The 100 ml of culture was added to 900 ml of 15N-enriched minimal media + amp. This was
incubated at 37°C til OD600 reached 0.819. Expression was then induced with 1 mM IPTG
and allowed to incubate at 27°C for 8 hours (OD600 = 1.676).
4) Cells were centrifuged at 5000 x g for 30 minutes. Supernatant was discarded and cell pellet
was stored at -20°C.
5) Cells were resuspended in 50 ml of native lysis buffer and sonicated (setting 4, 10 minutes
on, 30 seconds on/off intervals).
6) Lysate was centrifuged at 15,000 x g, 60 minutes, and at 4°C. Supernatant was collected and
membrane filtered (0.45 um Corning).
7) 2.5 ml of Ni+2-NTA beads were added to a 20 ml BIORAD column. A supplied disc was
place above the beads to keep them level and retain moisture. Column was washed twice
with water and twice with native lysis buffer. Protein solution was loaded onto column.
8) The protein wash washed with native lysis buffer until OD280 of final 1 ml was less than
0.05 (30 ml total).
9) The protein wash washed with wash buffer until OD280 of final 1 ml was less than 0.01 (20
ml).
10) The protein was eluted with 16 ml of elution buffer and collected as fractions.
11) The initial flowthru from this column was run through several fresh Ni+2-NTA columns
because several additional milligrams of recombinant protein could acquired this way.
12) Intermediates and elution fractions were analyzed via SDS-PAGE.
Native lysis buffer is 50 mM NaPi, 300 mM NaCl, 10 mM imidazole pH 8.0.
Wash buffer is 50 mM NaPi, 300 mM NaCl, 20 mM imidazole pH 8.0.
Elution buffer is 50 mM NaPi, 300 mM NaCl, 250 mM imidazole pH 8.0
This same protocol -- except for purification of column flowthru -- was also used to purify the
229 protein because it was presumed to be in the cytosolic phase.
Buffer exchange
To exchange sample buffer, one of three dialysis options were used. In addition, use of a
BIO-RAD desalting microcolumn allowed for rapid desalting. Gel filtration, which is described
in protein purification methods, was also used as a buffer exchange method because it dilutes the
sample buffer.
Dialysis
Testing for folding (denaturing to native conditions) – microdialysis buttons
Folding is assumed to occur during a slow exchange of buffer from denaturing to native
conditions. Small-scale dialysis was used to test the refolding of 316 HSap2-14d-NtermH6 tag
protein. A protein sample in 6M GuHCl, 5 mM DTT pH ~7.5 was first heated at 52°C for 10
minutes to disrupt any bonding. Clean microdialysis buttons were loaded with 5 ul of protein. A
wooden-peg “plunger” was used to seal a pre-soaked dialysis membrane (Spectrum Spectra/Por,
Molecular Weight Cutoff 3,500) upon the opening of the button. A black, plastic O-ring was
used to complete the seal. The procedure was repeated if air bubbles were observed upon sealing
the membrane. Attention was given as to not touch the dialysis membrane. The button
containing protein sample was immersed in wells of several 3 ml buffers varying in pH, salt
concentration, and type of buffering reagent used. Dialysis was conducted at 4°C for several
hours. After 21 hours, buttons were visualized under a light microscope (Stock lab darkroom),
and observations about precipitate were recorded. Buttons were left at 4°C for 2 more days, and
precipitate conditions did not change. Buttons were then left at room temperature and observed a
few weeks later. Conditions again did not change.
Small scale dialysis – cassettes
When protein samples in the range of tens to hundreds of microliters needed to be
dialyzed, cassettes (Slide-A-Lyzer by Pierce) were used. The cassettes allow for a small amount
of protein sample to be loaded via a syringe and 18-gauge needle through very small ports in the
cassette. The sample resides in a membrane pouch at the center of the cartridge.
The protocol for loading and unloading the sample from the cassette is found in the
Pierce Slide-A-Lyzer instruction manual. Careful attention was given as to not puncture the
sample with the needle. The loaded cassette was suspended in a beaker with dialysis buffer at
volume at least 1000x greater than sample volume. Sample was stirred (setting 2-4) and left at
4°C. If time permitted, dialysis buffer was replaced with new buffer at some point in the middle
of dialysis.
Large-scale dialysis
When protein samples in the ones to tens of milliliters needed to be dialyzed, dialysis
membrane tubing was used. Membrane was soaked in dH20 for 5 minutes, and sample was
poured or pipetted into the tubing. Orange, dialysis tubing clips were used to seal off the tubing
at both end, and sample was subject to dialysis as described above in “small-scale dialysis –
cassettes” section.
Buffer-exchange using BIO-RAD Desalting Micro-columns
BIO-RAD P-6 "Micro Bio-Spin" Chromatography Columns are the size of eppendorf
tubes. These columns retain molecules smaller than 6 kDa by use of a polyacrylamide gel
matrix. They are pre-packed with a Tris-buffer, and a new buffer can replace this column with 4
cycles of wash and equilibration with that new buffer. These columns were used to buffer
exchange protein samples 20 to 75 ul in volume. See BIO-RAD instruction manual for further
details.
Concentrating protein samples
"Centri"fugal devices
The Centriplus (original volume 2 to 10 ml) and Centricon (original volume less than 2
ml) from Millipore were used to concentrate 316 and 229 protein samples. Instructions can be
found in manuals. There is no evidence of the 316 or 229 protein samples irrversibly binding to
the YM membranes of the concentrating devices.
Lyophilization (for 316 C-termH6 tag protein)
The 316 C-termH6 tag protein was subjected to lyophilization. Lyophillization was tried
twice, and both times it appeared from SDS-PAGE analysis that a significant fraction of the
protein was lost in the procedure, perhaps as great as a 2/3 loss. It was decided that this was not
a good way to concentrate the sample.
Protein purification: FPLC
Gel Filtration
Gel filtration was conducted on the 316 C-termH6 tag protein. A Pharmacia High Load
16x60 Biotech Superdex 75 prep grade column was used. Its useful Mr fraction range for
proteins is reported as 3,000 to 70,000 kDa. It was attached to the FPLC system of the
Montelione lab. Two elution buffer conditions were tested: 200 mM NaCl, 25 mM NaPi, 2 mM
DTT pH 6.8 and 300 mM, 50 mM NaPi, 2 mM DTT pH 8.0. Flow rate of 2.5 ml/min was used.
SDS-PAGE analysis was performed on elution fractions.
Anion Exchange Chromatography: 316 C-termH6 tag protein
A Pharmacia Mono Q column was used to purify the 316 C-termH6 tag protein. The
column makes use of a quaternary ammonium strong anion exchanger. Protein was eluted from
column using a linear gradient from 0 to 1 M NaCl in 20 mM Tris-HCl, 5 mM DTT pH 7.5. A
major peak was observed at 58% into the linear gradient. SDS-PAGE analysis was performed on
elution fractions.
Cation Exchange Chromatography: 229 protein
A Pharmacia Mono S column was used to attempt to purify the 229 protein. The column
makes use of a methyl sulfonate strong cation exchanger. Elutiion from column was attempted
using a linear gradient of 1 M NaCl, 25 mM MES, 10 mM DTT pH 5.5. SDS-PAGE analysis
was performed on elution fractions.
Biophysical analysis
Mass Spectrometry (mass spec)
Mass spec was peformed on protein samples at the CABM mass spectrometer by Rong
Xiao. 1 ul of sample was mixed with a matrix and spotted onto a plate that is inserted into the
mass spec machine.
Far-UV circular dichroism (CD) spectroscopy: 229 protein
CD was performed on the 229 protein by Dr. Norma Greenfield of the RWJMS CD lab.
2D-NMR -- HSQC and TROSY: 229 protein
HSQC and TROSY experiments on the Montelione lab 500 MHz spectrometer were run
upon the 229 protein by Dr. Parag Sahasrabudhe. Spectra were referenced.
Results & Discussion
Note: Please make note that most methods described in this section have been described –
sometimes in great detail – in the Biochemistry portion of Experimental Methods. Please consult
this section if you have questions about methods.
Data and reports regarding the computational biology work may be found at our COG Analysis
and Expansion Website:
www-nmr.cabm.rutgers.edu/~bioinformatics/cogs
COG 0011S (computational biology)
An expansion of the COG resulted in the growth of the domain family from 3 sequences
to 9 sequences. The area conservation spans about 97 residues, which agrees with the lengths of
many of the full-length proteins. This allows one to assume that all of the proteins in the domain
family are single domain proteins. Since no Metazoan proteins were found to contain this
domain, work was discontinued after establishing the expanded domain family.
COG 0229S: C. elegans gene product (CEle1) – N-term hexa-histidine construct
Alexandra Gardino of the lab had already done the computational biology and molecular
biology work necessary for protein expression. This expanded domain family is conserved in at
least 15 gene products. It has representation among eukaryotic and prokaryotic organisms. The
region of conservation spans about 150 residues, and most of the full-length proteins in this
family appear to be single domain sequences.
Charles Lu of the lab had transformed expression vectors into various E. coli cell lines
and tested for expression and solubility. It was decided that the C. elegans gene product of the
domain family would be expressed for each of the three hexa-histidine tag loci options. Lu had
also shown through small-scale testing that significant amounts of the protein were found in the
cytosolic phase of the cell lysate. Lu previously purified the C-terminal hexa-histidine tag
protein via Ni+2-affinity chromatography and FPLC. His CD data showed this protein to be
about 50% random coil, and his HSQC data was hard to interpret either because the pH of the
sample was too high or the sample was very disordered.
Cell lysis and Ni+2-affinity chromatography
I resuspended a cell pellet that Charles had centrifuged from one liter of E. coli culture
induced to express the 15N-enriched N-terminal hexa-histidine tag protein. This pellet had been
stored at -20°C for a few months. The cells were lysed and purified as described in the "large
scale protein expression and Ni+2-affinity chromatography -- native conditions" section of
Experimental Methods. Purification intermediates were analyzed via SDS-PAGE and compared
to lysozyme standards on the same gels. The following yield of recombinant protein was
projected from 1 liter of culture (Figure 5a):
Total yield:
Soluble yield:
Protein eluted from Ni+2-NTA column:
65 mg
21 mg
7.8 mg (in 12 ml)
After this eluted protein was stored at 4°C for 7 days, it was realized that precipitate had formed.
DTT was added to 10 mM to this sample.
Dialysis
It was necessary to decrease the sample pH so that cation exchange chromatography
could be used to purify the sample. The protein was originally in a pH 8.0 buffer, and its pI is
7.70. 500 ul of the sample was buffer exchanged (Pierce cassette) for a 25 mM MES, 10 mM
DTT pH 5.5 buffer. After dialyzing for 24 hrs. at 4°C, some but very little precipitate was
visible. The rest of the sample was dialyzed (dialysis tubing) into the same buffer as the
previous step. After 10 hrs. of dialysis, much precipitate was present in the dialysis tubing. The
sample's pH was 6.22; crossing over the pI probably caused this precipitation. This sample was
analyzed via SDS-PAGE, and yield was estimated to be 3 mg. The old dialysis buffer was
exchanged for fresh buffer, and dialysis was allowed to occur for 14 hrs. more. No additional
precipitate was evident, which gives further reason to believe that the sample nearing pH of 7.7
caused precipitation. 3 mg of 9.5 ml of the sample was recovered at pH 5.45. Absorbance of the
sample at 280 nm light was taken; using the relationship that 1A(280) = 0.84 mg/ml, the yield of
this sample was calculated to be 3.6 mg. This is close in value to the 3 mg calculated from SDSPAGE analysis.
SDS-PAGE of the sample after the dialysis step for ion exchange chromatography shows
a decrease in high molecular weight proteins and an increase in low molecular weight proteins
after dialysis (Figure 5). Perhaps some degradation occurred in the 40 days on which I was
working on this sample. Mass spec taken on the 4oth day after the Ni+2-affinity purification
shows that the monomer species is predominant and that there are some lower molecular weight
species (Figure 6). This mass spec agrees with the previous SDS-PAGE analysis.
Cation exchange chromatography
Two cation exchange chromatography experiments, the first with 2 ml and the second
with 0.5 ml of the sample loaded, were conducted using the MonoS column and FPLC system.
A 1M NaCl, 25 mM MES, 10 mM DTT pH 5.5 linear gradient was used to try to elute the
protein. The cation exchange chromatograms showed a very large peak before salt gradient even
increased from 0%. The A(280) for these peaks were extraordinarily high; the first run had peak
greater than A(280) of 1.5 and the second run had a peak of 0.45. When these and other elution
fractions were visualized via SDS-PAGE, no protein was evident. The four elution fractions
corresponding to the peak of the second experiment were pooled together, concentrated
(Centricon) from 3.5 ml to 0.60 ml, and analyzed via SDS-PAGE. No protein was evident from
these fractions.
To try to determine the status of the protein during ion exchange experiments, a third
experiment was conducted in which 1 ml of the sample was loaded onto the column. A buffer
matching that of the sample was run through the column to try to flush the protein off the
column; no protein was evident via chromatogram, SDS-PAGE, or mass spec. We deduced that
the protein was bound to the column. A 1M NaCl, 25 mM MES, 10 mM DTT pH 5.5 was
applied at 100% to the column to try to elute the protein. No protein was visible (mass spec and
SDS-PAGE to confirm) and the large peak seen in the first 2 experiments was not visible. Next,
a 2 M NaCl pH 6.34 solution (100%) was used to try to elute the protein, and nothing again was
evident. Finally, a 1M NaCl, 50 mM HEPES pH 8.0 solution was used to try to elute the protein
from the column. This solution has pH greater than the pI; thus, the protein overall and the
column are now both negatively charge and should repel each other. Again, no eluted protein
was evident (chromatogram and SDS-PAGE). The most logical explanation for why the protein
does not elute from the column is that it has irreversibly bound to it. Thus, purification of this
protein on a Mono S column is not an efficient step.
Biophysical Characterization of 229 CEle-NtermH6 protein
Because only a small amount of the sample remained after our cation exchange
chromatography trials, we thought the best course of action was to analyze the remaining sample
using mass spec, CD, and HSQC experiments.
CD & HSQC with TROSY
The sample was adjusted to 4.64 and concentrated down from 4.5 ml to 0.225 ml (Figure
5b). The 2D-NMR experiment HSQC with TROSY (referenced) contains some dispersed peaks,
which is an indication that regions of the protein are folded (Figure 7). However, the “blob” of
peaks in the center of the spectra indicates that the magnetic moments of several nuclei are
aligned in the same direction, which results in peaks with similar frequencies. This in an
indication that portions of the protein are unstructured. It may be that these peaks are partially
due to the smaller molecular weight, contaminating proteins. However, CD on the same sample
(15 ul of sample diluted in 285 ul of 10 mM KH2PO4, 2 mM DTT pH 5.65) showed a similar
result --- half of the protein appeared to be random coil. These data were similar to that acquired
by Lu on the C-termH6 tag construct (Table 4).
COG 0316S: H. sapiens gene product 2 (HSap2)
Computational biology
An expansion of COG 316 resulted in the growth of the domain family from 9 to 39
protein sequences (Table 5), representing 16 prokaryotes and 7 eukaryotes. The domain is
approximately 104 residues in length, and most of the proteins appear to be single domain
proteins. The domain sequence is characterized by an overall striking sequence conservation,
with several interesting sequence features. For example, the domain contains 3 conserved
cysteines located next to conserved glycines. The C-terminal region also contains a motiff
indexed in Prosite but of unknown function. A phylogenetic tree was made from the msa, which
allowed to choose subgroups of the domain to work on (Figure 8). Ultimately, the choice was
made to work on the HSap2 domain sequence.
Biochemistry
One note about the 316 HSap2 domain sequence. There is only one residue that
contributes significantly to the extinction coefficient at A(280). That residue is a tyrosine, which
results in a very low molar extinction coefficient of ~1760. This makes quantification of protein
via A(280) difficult and unreliable. Comparison of SDS-PAGE results and the A(280) readings
confirm this. Thus, protein was always quantified by analysis via SDS-PAGE.
Expression Testing and 316 HSap2- 14-Nterm-hexahis construct
Eight of the nine HSap2 expression vectors for COG 0316S (Table 3) were tested for
levels of total protein expression at 37°C as described in “testing total protein expression (smallscale) via boil/chill lysis” of Experimental Methods. Extrapolations were made for how much
total protein would be expressed in 1 liter of minimal media and 1 liter of LB (Figure 9). These
data indicate for this domain, the expression vectors with the T7 promoter and those in the pLysS
cells seem to express the best. In addition, the pET 14 construct generally expresses in high
quantity. For these reasons, the pET 14 construct in pLysS cells was chosen for large scale
expression. Because small-scale solubility testing with B-PER indicated that all 316 domain
sequences were in the insoluble phase (inclusion bodies). 500 ml of culture containing 15Nenriched HSap2-14d-Nterm-hexahis was lysed and purified in denaturing conditions (see “largescale protein expression and Ni+2-affinity chromatography purification – denaturing conditions”
in Experimental Methods). This resulted in a pure yield of about 10 mg of recombinant protein
isolated from Ni+2-affinity chromatography.
The pH of the sample was raised to ~7.5 and DTT was added to 5 mM. Sample was
stored at 4°C for about 25 days. Attempts at folding were then made via microdialysis (see
Experimental Methods: microdialysis). In each condition tested, some precipitate was found
after dialysis (Table 6). A trend seemed to be that the closer the dialysis buffer pH was to the pI
(5.76) of the protein, the lesser the amount of precipitate.
Solubility testing via sonication
During the refolding experiments, Dr. Kristin Gunsalus advised that the B-PER reagent
had given her false-positives regarding insolubility of protein for some of her samples. It was
also brought to my attention that reducing the rate of transcription/expression by lowering
incubation temperature post-induction had the effect of increasing proportion of protein in the
soluble phase. High expressing vector and cell line combinations were again tested for solubility
and expression, but now at 27 and 37°C and after lysis via sonication (Table 7). B-PER was also
tested, and it showed no protein soluble after the 37°C post-induction period. The 23d-CtermH6
tag construct showed promising results.
Large scale expression and purification in native conditions for 15N-enriched HSap2-23dCtermH6 construct
The 316 HSap2-23d-Cterm-hexahis protein was epxressed in one liter of 15N-enriched minimal
media. Expression, cell lysis, and Ni+2-affinity purification are described in Experimental
Methods. SDS-PAGE analysis shows the amount of recombinant protein in the following steps
to be (fig: 316 – Cterm Ni+2-affinity purif gels):
sonication
soluble sup
after filtering
loss due to flow-thru
recovered from elution
> 72 mg
60 mg
60 mg
-30 mg
9 mg
It was noticed that a large amount of the target protein remained in the flowthru that
passed through the column after loading the crude cell lysate. Passing this fraction through the
column again did not improve recovery. When the flowthru was passed through columns of
fresh Ni+2-NTA, an additional 7-9 mg of target protein could be recovered. This was done using
3 columns in series with about 2 to 5 ml of Ni+2-NTA beads per column. A trend was also
observed of that the greater the volume of fresh beads used, the more protein that was recovered.
This indicates that this protein easily saturates the Ni+2-NTA beads. If more Ni+2-NTA columns
were used, more target protein that will be recovered. However, as the volume of the beads in a
column increases, the hydrostatic pressure of the solution above the beads and the flow rate both
decrease. After running this sample through 4 columns of Ni+2-NTA beads, the total of target
protein recovered now increased to ~16 mg.
Gel filtration studies
Gel filtration showed that the majority of the protein eluted in the void volume for the
column. This large peak for this volume corresponds to globular structures greater the 70 kDa
(Figure 11). Observed was also a small peak at a time corresponding approximately to the
dimeric weight of the protein. When the fractions corresponding to this peak were concentrated
and analyzed via SDS-PAGE, the monomer appears faintly on the gel. In addition, a band
corresponding to a dimer also appears on the gel. This band actually appears in every lane in
every gel for which the 316 protein was run. The protein may be in equilibrium as several
different molecular species. Perhaps, the protein is present in a complex with molecular weight
greater than 70 kDa and with this association resistant to DTT. Subjecting the protein to the
denaturing and reducing conditions of SDS-PAGE with B-ME may disassociate the protein
predominantly into a monomer. However, there may be a covalent association between a small
portion of the monomer units to form a homodimer. This may explain the small peak in gel
filtration profile and the supposed dimeric band in the SDS-PAGE analysis. The protein may be
in equilibrium as a 70+ kDa species, a dimer, and a monomer. On the other hand, this may just
be evidence of aggregation and partial dissociation of the aggregate under certain conditions.
Ion exchange chromatography
Because the gel filtration experiment was not successful in purification and because the
Ni+2-affinity chromatography leaves contaminating proteins in the sample, anion exchange
chromatography was implemented.
Buffer-exchanging the 47.5 ml of sample was first necessary to bring to the protein to
suitable conditions (i.e. low salt) for a Mono Q column. The 50 mM NaPi, 300 mM NaCl, 250
mM imidazole pH 8.0 buffer was exchanged via dialysis for a 20 mM Tris, 5 mM DTT pH 7.5
buffer. No precipitate was evident. The entire sample was then loaded onto the Mono Q
column. Flowthru was collected, and it was discovered via SDS-PAGE that a small amount of
the protein did not bind the column; the column was probably saturated. The protein that was
bound the MonoQ column was eluted using a 1M NaCl linear gradient (from 0 to 100%). A
large, somewhat broad peak was observed at 58% into the gradient (Figure 12). Fractions 9 to
20 (~12 ml total) of the eluate were found to contain the protein via SDS-PAGE analysis. These
fractions were pooled together and analyzed via SDS-PAGE (Figure 13b). Because of the
appearance of the peak, a bad assumption was made that most of the protein was eluted from the
column. However, when estimating via SDS-PAGE the amount of target protein in the ionexchange purified sample only ~2 mg of protein was seen. This represents an ~80% loss.
Fractions 24 to 33 from the ion exchange profile were concentrated from 10 ml to 0.325 ml and
were analyzed via SDS-PAGE. It was estimated that these fractions collectively contain only 20
ug of the protein. If most of the protein is not present in the elution fractions or in the flowthru,
then it must still be bound to the column. It may be necessary to work with buffers that are more
prone to compete with protein for the column’s sulfonate. Another possibility is that the protein
may have irreversibly bound the to the column’s resin constituent. Alternatively, perhaps protein
may have degraded over time. Mass spec of the ion exchange purified sample shows the largest
peak to be at m/z =3720 and no peak to be present at the molecular weight of the domain
sequence. However, SDS-PAGE of this sample shows the protein to be largely present as the
monomer (Figure 13). In addition, comparing the Ni+2-affinity chromatography purified sample
to the Ni+2-affinity chromatography and anion exchange purified sample show the protein to be
slightly purer (Figure 13).
Conclusion & future work
We have shown that nickel-affinity chromatography and ion exchange chromatography
(with considerable loss) to remove contaminating proteins from our 316 sample. Gel filtration
shows the domain to form a high molecular weight species, with a multiple of bands showing on
reducing and denaturing gels. This phenomenon has been observed in the past for
homomultimers that are covalently bound and experience an equilibrium between different
molecular weight species. CD will be performed on the purified sample of this COG.
Sedimentation-equilibrium may allow us to ascertain the molecular weight of species in solution.
A 30 kDa cut-off filter will be used on the sample to see if the lower molecular weight species
can be separated. In addition, the protein sample may be synthesized in larger scale than before,
subjected to the high-yield recovery and purification steps proven effective, and combinatorially
tested for crystallization suitability.
Elsewhere, some but limited insight has been gained in respect to COG 0316S. RNAi
experiments (Fabio Piano -- Cornell Univ.) done using the C. elegans cDNA as template result in
early larval lethal phenotype with complete penetrance. This result was shown in three repeated
trials. In addition, some of the genes coding for ORF's for this domain come from operons that
are active during nitrogen-fixation in some bacteria. Moreover, many neighboring genes in these
operons are implicated in iron-sulfur cluster formation or transportation. The 316 domain may be
a protein that is associated with iron-sulfur clusters through its conserved cysteines, and it may
act as a metabolic pathway member. These conserved cysteines also give reason to analyze the
protein in oxidizing conditions.
We have shown that mono-S column chromatography to be ineffective for purifying
domain 229. Gel filtration will be used to purify this domain sequence. Exopeptidases will be
applied to see if limited proteolysis will cleave off the region of the domain sequence possibly
responsible for disorder. CD and NMR will follow.
References:
Altschul, Stephen F., et al. NAR. 25:3389-3402. 1997.
Bairoch A., Apweiler R. NAR. 28:45-48. 2000.
Chothia, C. Nature. 357:543-544. 1992.
Corpet, F., Gouzy, J., Kahn, D. NAR. 26: 323-326. 1998.
Corpet, F., Gouzy, J., Kahn, D., NAR. 27:263-267.
Feng, W., et al. Biochemistry. 31:10881-96. 1998.
Gunsalus, K.C., et al., submitted. 2000.
Hofmann K., Bucher P., Falquet L., Bairoch A. NAR. 27:215-219. 1999.
Holm, L. & Sander, C. NAR. 25:231-234. 1997.
Holm, L. & Sander, C. Science. 273:595-602. 1996.
Huang, X. Genomics. 33:21-31. 1996.
Hwang, K.Y., et al. Nature Structural Biology. 7:691-696. 1999.
Levitt, M. & Gerstein, M. PNAS USA. 95:5913-5920. 1998.
Lima, C.D., Klein, M.G., & Hendrickson, W.A. Science. 278:286-290. 1997.
Montelione, G.T. & Anderson, S. Nature Structural Biology. 6:11-12. 1999.
Montelione, G.T., et al. NMR Pulse Sequences and Computational Approaches for Automated
Analysis of Sequence-Specific Backbone Resonance Assignments of Proteins. Biological
Magnetic Resonance, Volume 17: Structure Computation and Dynamics in Protein NMR. Eds.
Krishna & Berliner. Kluwer Academic / Plenum Publishers: New York. 1999.
Patthy, L. Matrix Biology. 5:301-310; discussion 311-312, 1996.
Rost, B. Meth. in Enzym. 266: 525-539. 1996.
Rost, B. & Sander, C. J. Mol. Biol. 232: 584-599. 1994.
Rost, B. & Sander, C. Proteins. 19: 55-77. 1993.
Schuler, et al. Science 274, 540-546. 1996.
SPARKY: http://www.cgl.ucsf.edu/home/sparky/
Tatusov, R.L., Koonin, E.V., & Lipman, D.J. Science. 278:631-637. 1997.
Terwilliger TC & Berendzen J. Genetica. 106(1-2):141-7, 1999.
Thompson, J.D., Higgins, D.G., Gibson, T.J. NAR. 22:4673-80. 1994.
Tilghman, S.M. Genome Research. 6:773-780.
Legends
Figures
Figure 1. Computational biology protocol.
Steps up to "Search EST databases" were applied to COG 0011S to expand the domain family.
This entire protocol was applied to COG 0316S to expand the domain family and select
sequences to express.
Figure 2. Molecular biology protocol.
This protocol was applied to COG 0316S. 5 coding sequences were ligated into the "holding"
vector, but only 3 of these coding sequences were of suitable quality for ligation into expression
vectors (see tables 2 and 3).
Italicized steps are those that may be omitted in a high-throughput operation.
Figure 3. Digest and cloning scheme for formation 316 CEle1 (a) and HSap2 (b) expression
vectors.
A single PCR product may be produced for ligation into 9 different expression vectors of the
Gunsalus expression kit (Gunsalus, et al., 2000). To preserve the C-terminal protein sequence
for the N-terminal hexa-his tag and non-fusion constructs, 2 different PCR products were
synthesized (indicated by the 2 types of reverse primers for CEle1).
Changes were made to the 5' and 3' regions of the coding sequence for each domain to
accommodate ligation into the expression vectors. In some cases, this changed the native protein
sequence at its N-terminal and/or C-terminal regions. This is indicated in the peptides sequences
section of the (a) and (b).
Figure 4. Sequence for 229 CEle Nterm hexa-histidine tag construct.
Computational biology, molecular biology, and protein expression and solubility testing for
domain family 0229S were done previously by Alexandra Gardino & Charles Lu. The sequence
of is 162 residues long, has a molecular weight of 18,449 g/mol, molar extinction coefficient of
approximately 21870, pI of 7.70, and charge of 2.85 at pH 7. Make note of the 8 cysteines.
Computation biology work for this COG may be found at the CABM NMR Lab COG Analysis
& Expansion Website.
Figure 5. Purification of 229 CEle Nterm hexa-histidine tag construct. (a) shows protein
purified using Ni+2-affinity chromatography under native conditions. Yield is approximately 8
mg of target protein.
1. total cell lysate sonicated
2. soluble protein
3. column flowthru
4. washes
(b) shows protein after Ni+2-affinity chromatography and dialysis to pH 5.45 buffer. During this
pH reduction from pH 8.0, some of the protein precipitated, reducing yield to approximately 3
mg of target protein. It also appears that some of the higher molecular weight protein has
reduced in quantity while lower molecular weight protein has increased in quantity (compare a
and b). This may indicate degradation of higher molecular weight contaminants.
Figure 6. Mass spectrometry of 229 CEle Nterm hexa-histidine tag protein that had undergone
nickel affinity purification plus dialysis across pI (figure 5b).
The major peak corresponds to the protein’s molecular weight. A smaller, higher molecular
weight peak indicates that a portion of the protein forms a dimer during mass spectrometry
conditions.
Some lower molecular weight contaminants remain in the sample after Ni+2-affinity
chromatography and dialysis across the pI of the protein. These data are consistent with figure
5b.
Figure 7. 229S CEle Nterm hexa-histidine tag protein HSQC with TROSY for (pH 4.64).
Figure 8. Unrooted phylogenetic tree of domain family 316.
The tree graphically represents the similarity between sequences. Lengths of branches
correspond inversely to degree of similarity between sequences on the branches. Branching is an
indication of a supposed evolutionary split between similar sequences. Those sequences that we
chose to clone are circled. Note how they are in different groups.
Figure 9. COG 0316S HSap2 small scale expression testing.
LB and MJ9 (minimal) refer to media types. DE3 and pLysS refer to expression cell lines of the
BL21 series available from Novagen. Testing was conducted at 37°C. Whole-cell lysate was
analyzed via SDS-PAGE. Vector construct notation is explained in figure 3.
Figure 10. 316 HSap2-Cterm-hexahis tag protein purified via Ni+2-affinity chromatography.
Gel 1 shows that status of the protein during this purification. Gel labeling refers to:
1
Total cell lysate (via sonication) after expression
2
Supernatant (soluble proteins) of total cell lysate
4
#2 filtered using sterile 0.22 um filter
5
Flow-thru from nickel column affinity purification.
WASH …
Washing of nickel column (bound with column) with increasing
concentrations of imidazole solutions (300 mM NaCl, 50 mM NaPi, pH 8.0; imidazole
increased from 10 to 20 to 250 mM (250 is elution concentration).
5. 4 ug lysozyme (for standard)
Gel 2 shows the elution fractions from this purification.
Gel 3 shows a nickel-affinity column (fresh nickel) purification of the flow-thru (gel 1,
lane 5) of the previous column run.
Gel 4 shows that a significant amount of the protein is still present in the flow-thru. This
flow-thru was run through 4x more through nickel columns to recover a total of about 12
mg protein.
Figure 11. Gel filtration of 316 HSap2-Cterm-hexahis tag protein. (a) refers to chromatogram of
the experiment. SDS-PAGE analysis showed most of the protein to be present in the void
volume (40 ml elution point). This corresponds to Mr > 70 kDa. A very small amount of the
protein elutes present in the 62 ml point. This roughly corresponds to the protein’s dimeric
molecular weight.
(b) represents a calibration curve for the column.
Figure 12. Anion exchange chromatography purification of 316 HSap2-Cterm-hexahis tag
protein.
(a) refers to a chromatogram from purification on a Pharmacia Mono Q column. A linear
gradient of 1M NaCl buffer was used for elution. Elution buffer (pH 7.5) also contained 20
mM Tris, 5 mM DTT. Protein predominantly elutes from column 58% into the 1M NaCl
gradient.
(b) refers to SDS-PAGE analysis of elution fractions corresponding to figure 12a. Figure 13b
shows yield after this purification step.
Figure 13. COG 0316 HSap2 Cterm hexahis after Ni+2-affinity chromatography and anion
exchange chromatography purification.
(a) refers to protein just after Ni+2-affinity chromatography.
(b) refers to protein concentrated after Ni+2-affinity chromatography and anion exchange
chromatography purification. Yield was reduced by ~80% during the anion exchange
chromatography experiment. However, higher molecular weight contaminants were
removed during this step.
Tables
Table 1. DNA acquisition for 316 domain.
Genomic DNA was obtained from neighboring labs. cDNA was ordered from Genome Systems
of the IMAGE Consortium.
The yeast gene can be obtained from a cosmid in which it has been cloned. However, we were
able to conveniently use genomic DNA from a neighboring lab. The shading indicates that we
did not use the cosmid.
Table 2. Partial listing of PCR errors generated during cloning of coding sequences for domain
316.
Sequence errors for all PCR products were found in the region coded by primers. No pattern of
errors is evident from analysis of primer sequence and PCR product sequence. In addition, the
same primer batch gave different patterns of errors for separate PCR reactions. The errors
limited the number of expression vectors that were constructed (table 3).
Table 3. Expression vector status for 316 domain.
In most cases, 2 types of PCR products were made. One product (C type) was made in which the
DNA encoding the C-term portion of the domain was modified to accommodate a hexa-histidine
tag. Another product (N type) was made in which native sequence at this C-term encoding
region was left intact.
Table 4. Circular dichroism data for 229 CEle-Nterm-hexahis protein and CEle-Cterm-hexahis
protein (Lu data).
These data consistently show that the 229 hexahis fusion protein is approximately 50% random,
15% alpha helix, 32% beta strand, and 4% turn.
Table 5. Members of the expanded 0316 domain family.
The 316 domain family was expanded to include 39 sequences – 16 of the sources being
prokaryotic and 7 being eukaryotic.
Red indicates original COG member. Black/green indicates those added to the expanded COG
0316S family. Green indicates sequence currently being investigated by us. ID refers to label
given by us to each protein sequence. Swiss-Prot ID's begin with "P", "Q", or "Z". (P) indicates
prokaryote and (E) indicates eukaryote. EST appearing under protein length signifies that the
entire sequence of the protein is not known.
Table 6. Folding of 316 HSap2-Nterm-hexahis: microdialysis from denaturing to native
conditions.
Original protein buffer was 6 M GuHCl pH ~7.9 and all solutions contained 5 mM DTT. Protein
was heated at 52°C for 10 minutes before dialyzing. Concentration of protein was ~250 µg/ml.
Observations for precipitate were made under light microscope after 10 hrs. of dialysis at 4°C.
Samples were again checked 12 hrs. after initial observations; no changes occurred. Samples
were then allowed to sit at room temperature for 2 days; no changes occurred.
Table 7. Small-scale solubility testing of expression of COG 0316 Hsap2 variants using
sonication for cell lysis.
Testing was performed in minimal media. Whole cell lysate (total protein) and supernatant
(cytosolic protein) was analyzed via SDS-PAGE. Target protein levels were projected for one
liter of bacterial culture.
Each line contains mg of soluble protein divided by mg of total protein for a temperature/cell line
condition. "Inc" stands for inconclusive, meaning solubility of protein could not be judged due
to unsuccessful sonication.
Cells went through post-induction period of 5 hours at 37C or hours at 27C. These promoter /
tag location / cell line conditions were also tested for 27C / 8 hr. post-induction period using the
detergent B-PER to recover protein. These tests for all 5 different conditions indicated that
protein was insoluble.
Promoter / tag location / cell line combination chosen for large-scale expression and
purification under native conditions is boldfaced.
Choose COG (small, unknown function).
Create an msa of sequences.
Decide upon ends of domain (regions of conservation and start/stop codons).
Write a consensus sequence from alignment (degeneracy allowed).
Check PDB for possible homologues (BLAST).
Check for transmembrane features (PredictProtein).
Use HMMER to search nr db for possible homologues.
Determine if sequences should be added to expanded COG (E scores, region of alignment, phylogenetic
tree, msa).
Check PRODOM & Prosite for any possible similar domains and motifs.
Check Swiss-Prot for any functional information about sequences.
Search EST databases for possible homologues (Unigene). If similar EST's are found, create a consensus
among the overlapping EST's (Unigene, CAP, 6-frame translation). Add to msa/tree and decide if new
sequences are possible homologues.
Finalize domains ends (previous, 2° structure prediction).
Select sequences to express (tree, availability of DNA/cDNA, intron interrupting ORF, common
restriction sites in DNA, ease in primer design).
Figure 1
Determine sequence of DNA that will be "PCR-ed out" (sequencing, if necessary)
Strategically design primers (restriction sites)
PCR out coding sequence from genomic or complementary DNA
Run insert on and purify from gel
Adjust insert to appropriate concentration
Ligate into a holding/amplification plasmid
Transform into amplification / holding cells, plate, and incubate
Pick multiple colonies according to blue (negative) / white (positive) selection
Verify ligation of insert using restriction enzyme digests and/or colony PCR
Sequence insert and verify for accuracy according to database and primer sequence
Amplify insert in holding vector in large bacterial culture, and purify plasmid
Digest insert from vector strategically for ligation into expression vectors
Run insert on and purify from gel. Concentrate insert.
Prepare expression vectors (amplify digest strategically, purify, and concentrate)
Ligate insert into expression vectors
Check to see if insert is present
Figure 2
(a)
(b)
Figure 3
MGHHHHHHSHMTTKKFRMEDVGLSKLKVEKNPKDVKQTE
WKSVLPNEVYRVARESGTETPHTGGFNDHFEKGRYVCLCCG
SELFNSDAKFWAGCGWPAFSESVGQDANIVRIVDRSHGMHR
TEVRCKTCDAHLGHVFNDGPKETTGERYCINSVCMAFEKKD
Figure 4
1
Figure 5
2
3
4
Elution Fractions
Figure 6 is a mass spectrometry result file of COG 229 Cele1-NtermH6 tag-14 protein. It is currently not on
the web. It is available from the author upon request.
Figure 8 is a phlyogenetic tree of the COG 316 domain family. It is viewable on the web at the
CABM NMR Lab COG site, which is approachable from www-nmr.cabm.rutgers.edu
1
2
3
4
6
5
WASH
ELUTION FRACTIONS
Gel 2
Gel 1
ELUTION FRACTIONS
2ug lysozyme
Gel 3
4ug lysozyme
Figure 10
FLOW THRU FROM NICKEL 2ug lysozyme
AFFINITY COLUMN
4ug lysozyme
Gel 4
(a)
(b)
Figure 11
(a)
(b)
17
1
1
Figure 12
9
11
17
18
19
22
Purified by NI+2-affinity chromatography
(a)
Figure 13
Purified by NI+2-affinity and anion exchange
chromatography (concentrated).
(b)
My
Code for
Domain
Bsub1
Species
Strain
Accession Number for
Protein/DNA target
Source / what
was given
Bacillus
subtilis
168
Protein = 2635713 (ncbi)
Genome = Z99120 (ncbi)
S. Anderson via
Rehan Azia /
genomic DNA
Hinf1
Haemophilus
influenzae
RD/KW20
Scer2
Saccharomyces
cerevisiae
S288C/AB97
2
Protein = P45344 (SP);
1175501 (ncbi)
Genome = 1574575 (ncbi)
Protein = Z12425 (SP);
2495215 (ncbi)
Genome = Z71255
Chromosome cosmid =
805025 (ncbi); Z49219 (ncbi)
C41876
C31399
EST = AA727377 (ncbi)
S. Anderson via
Rehan Aziz /
genomic DNA
Steve Brill /
genomic DNA
Cele1
Mmus1
Caenorhabditis
elegans
Mus musculus
CB1489 him8(e1489)
C57BL/6
Hsap2
Homo sapiens
From brain;
anaplastic
oligodendend
rogliome
tissue
Table 1
EST = AI202743 (ncbi)
XX
Yuji Kohara
(Japan) / EST
WashU-HHMI
Mouse EST
Project; contact
Marra M / Mouse
Project via
IMAGe (EST)
Contact: Robert
Strausberg, Ph.D.
(NCI-CGAD) via
IMAGE (EST)
CDNA#
(necessary
for ordering)
Reported
Length (if
applicable)
Other notes
ORF we need is
complement of
107656..108018;
We have strain 1A243
ORF we need is
2450..2794 of U32845
ORF we need is
151843..152400
Cosmid
9299
Yk282c1
IMAGE:
1209939
IMAGE:
1859411
ORF we need is
30083..30640
Insert  360
bp
447 bp
450 bp
Part of Unigene
Mm.7884
Part of Unigene
Hs.63913
PCR
Product
BSub1-1
BSub1-2
CEle1-1
CEle1-2
HInf1-1
Table 2
Expected
sequence
CGA AGC
TTA TTA
GGT
T
GC
Observed
sequence
CGA ATC GTA
GTA CTT
G
CG
HInf1-1
GGGATCCTG TCTGATCCGG
G
Ax4
Cx4
HInf1-1
TCC
TAA
HSap2
GCC
GCT
SCer2-1
T
G
Comments
Reverse primer. Errors cause loss of
restriction site and a stop codon. Error
causes lys  pro.
Forward primer
Error is at 5' end of forward primer.
Upstream of first restriction site; will not
cause a problem
No errors
Forward primer. Several errors cause loss
of BamHI site.
Reverse primer. Substitution error
occurred at least 4 times.
Reverse primer. Caused loss of stop
codon.
Forward primer. Codons are synonymous
for ala and are not rare. Will not cause
problems.
Reverse primer. Loss of HindIII site.
My
Code for
Domain
Species
PCR
product(s)
made?
Sequence error(s) in PCR
product?
Nature of error(s)
# of
expression
vectors cloned
Types of expression vector cloned (by
affinity-tag)
N-term H6 C-term H6 Nonfusion
3/9
0
0/9
Bsub1
Bacillus
subtilis
Yes
Hinf1
Haemophilus
influenzae
Saccharomyces
cerevisiae
Caenorhabditis
elegans
Yes
C type is fine.
N type has error in primer
annealing region.
Yes
Yes
Yes
Yes
Mus musculus
Homo sapiens
No
Yes
C type has error in primer
annealing region.
N type is fine.
not applicable
None
Scer2
Cele1
Mmus1
Hsap2
Table 3
0
0
T7
T7 lac
T5 lac2
0
0/9
0
0
0
5/9
T7
T7 lac
T5 lac2
0
T7
T7 lac
0
T7
0
T7
T7 lac
T5 lac2
T5 lac2
0
T7
T7 lac
T5 lac2
0/9
8/9
0
structure type
alpha
beta
random
turn
Table 4
Per cent secondary structure
25 degrees 10 degrees Lu data
Celsius
Celsius
15.8
11.7
14
29.2
32.6
39
48.5
52.1
44
6.4
3.4
3
ID
Swiss Prot ID or
GenBank ID
ECol1
Ecol2
Ecol3
HInf1
Hinf2
SSp1
SSp2
SCer1
SCer2
AVar1
Avar2
ASp1
P36539
P77667
P37026
P45344
P44672
P72731
P74596
Q07821
Q12425/Z12425
P46051
P46052
P18501
PBor1
MTub1
SPom1
SPom2
RSp.1
PPur1
BJap1
ABra1
AVin1
Avin2
P46053
Q10393
P78859
2950483
Q53211
P51217
P37029
Q43895
Q44540
2271523
Plectonema boryanum (P)
Mycobacterium tuberculosis (P)
Schizosaccharomyces pombe [strain
972h- for Spom2] (E)
Rhizobium Sp. (P) NGR234
Porphyra purpurea (P)
Bradyrhizobium japonicum (P)
Azospirillum brasilense (P)
Azobacter vinelandii (P)
121
118
190
205
106
114
118
125
107
107
FAln1
RCap1
RSph1
BAph1
Q47887
Q07184
Q01195
2738590
Frankia alni (P)
Rhodobacter capsulatus B10 (P)
Rhodobacter sphaeroides (P)
Buchnera aphidicola (P)
135
106
106
113
AAeo1
BSub1
CPCC1
Ovol1
Cele1
2984147
2635713
2183309
AA68340
C41876
116
120
119
EST
EST
Rat1
Rat2
Mmus1
Mmus2
Hsap1
Hsap2
Unigene Rn.3442
AI059493
Unigene Mm.7884
3447426
Unigene Hs.10473
Unigene Hs.63913
Aquifex aeolius (P)
Bacillus subtilis 168 (P)
Cyanothece PCC8801 (P)
Onchocerca volvulus (E)
Caenorhabditis elegans (E) CB1489
him-8(e1489)
Rattus norvegicus (E)
Unknown
Unknown
May be required for Nfixation
May be required for Nfixation
Unknown
Unknown
Unknown
Unknown
Unknown
Unknown
Unknown
Unknown
Unknown
Role in Fe-S cluster
assembly (?) [UI
98250785]
Unknown
Unknown
Unknown
May be involved in
septum formation during
cell division [UI
98087557]
Unknown
Unknown
FifENXW operon (?)
Unknown
Unknown
EST
EST
EST
EST
EST
EST
Unknown
Unknown
Unknown
Unknown
Unknown
Unknown
Table 5
Organism
Protein
Length
(a.a.)
Escherichia coli (P)
107
122
114
Haemophilus influenza RD/KW20 [strain
114
for Hinf1] (P)
107
Synechocystis Sp. (P)
118
113
Saccharomyces cerevisiae (E)
250
185
Anabaena variabilis (P)
123
123
Anabaena Sp. (P)
113
Mus musculus (E) C57BL/6 [strain for
Mmus1]
Homo sapiens (E)
Functional
Implications
Unknown
Unknown
Unknown
Unknown
Unknown
Unknown
Solution
pH Precipitate
Med-high
Med-high
Precipitate when in
Buffer + 0.1 M NaCl
High
Med-high
7.0
7.0
Med-high
Med
Med
Med
7.5
Light
Light-med
10 mM NaOAc 5.0
10 mM NaPi 6.0
10 mM NaPi
10 mM Bis
Tris
10 mM Tris
Table 6
promoter
T7
T7 lac
Table 7
H6 tag location
N-term
C-term
0/5.3 : 37C, DE3
Inc/12: 37C, DE3
Inc/20 : 37C, pLysS
Inc/16: 37C, pLysS
1/24 : 27C, DE3
16/20: 27C, DE3
2/32: 27C, pLysS
16/>24: 27C, pLysS
Not tested
Inc/12: 37C, pLysS
0/24: 27C, pLysS
Chapter 2: A simulation of structural similarity detection as a method for prediction of
orphan gene product biochemical and cellular function
Abbreviations
PDB, Protein Data Bank; DALI, Distance Matrix Alignment; SCOP, Structural Classification of
Proteins; NCBI, National Center for Biotechnology Information; CDS, coding sequence; PIR,
Protein Identification Resource Database; NR, nonredundant.
Abstract
Many of the proteins corresponding to genes currently identified from large-scale nucleic
acid sequencing projects show no significant sequence similarity for homology detection of
proteins that have been functionally characterized well; hence, they fall under the category
“orphan genes.” Scientists have developed alternative methods for predicting biochemical
and/or cellular functions of these proteins. Results from these predictions can greatly assist in
future functional studies of the proteins. We have used the available collection of protein
structural data and popular sequence and structure comparison programs to conduct a simulated
experiment showing that structural comparison using functionally characterized protein
structures can be used as tool for predicting the biochemical and/or cellular functions of protein
domains of unknown biochemical and/or cellular function when sequence similarity fails to
present any statistically significant homologues. Tested were 10 randomly selected domains of
different folds, and we have found that the major candidate functions provided by the poor
sequence similarity, structural hits of eight of the domains provided clues that led back to the
function(s) of those domains.
Introduction
The goals set forth by the Human Genome Project have led scientists to develop new
ways and refine existing ways of detecting the functional implications of genes and the proteins
for which they code. Many “genomicists” are making direct use of the sequence data obtained
from the Project to better their understanding of the roles of the proteins coded by the recently
identified genes. In these cases the sequence information alone is enough to provide scientists
with a sequence homology to functionally characterized proteins, enabling them to interpret the
cellular and/or biochemical function of their studied protein. A strong sequence similarity or
identity with another protein is taken as a sign that the two proteins are homologous. When a
homology exists between two proteins it is assumed that they have evolved from a common
ancestor and share function. However, many proteins only show a possible homology with
proteins of known function in the “twilight zone” of homology, a range that shows no reliable
way for determining homology. In addition, a large proportion of the sequenced genes shows no
similarity or identity, and thus no evident homology, with any proteins of known function. Both
of these groups make up the so-called “orphan genes” (Holm & Sander, 1996), which account
for between 1/3 to1/2 of newly sequenced genes (Botstein et al., 1997; Casari et al., 1996). The
question that arises is which alternative approach to use in functional prediction of genes.
Another possible approach is structural similarity detection, a method that has not been
thoroughly investigated in literature. This approach involves solving the structure of the
scrutinized protein, searching the protein structure database (e.g. PDB) using an algorithm (e.g.
DALI) designed to detect significant structural similarity, and investigating the biochemical and
cellular functions of the scrutinized protein’s homologues. The progress of NMR and X-ray
crystallography is adding structures at rate well over a thousand structures per year to the
thousands already housed in the PDB (found in statistics page of http://pdb.pdb.bnl.gov). This
increase in data plus the recent strives to improve the structural and functional databases that
manage this data are providing a growth in the force of this approach (Holm & Sander, 1996).
Recent work has shown that structural similarity detection has been much more powerful
than sequence similarity detection for isolating homologous proteins or protein substructures
during examinations of databases and protein superfamilies. (Levitt & Gerstein, 1998; Holm &
Sander, 1997; Lima et al., 1997). This is true because of the millions of physically possible
amino acid combinations that can occur over the length of a typical domain filter down to a
relatively small number of natural protein folds, estimated to be in the range of 1000-1500
(Chothia, 1992). This provides impetus for us to understand nature with a "structure-based"
approach to functional genomics in the following experiment.
We simulated as though the structures of 10 domains were solved (in reality they were
already solved and functionally characterized) and no homologues were known by lenient
measures of sequence similarity. After comparing the structures of the 10 domains to those in
the PDB, and after an inspection of the reported functions of those nonredundant, significant
structural hits of insignificant sequence similarity, we found that the major candidate functions
suggested by the inspected hits offered valuable insight into the biochemical/cellular functions of
eight of the 10 target domains of "unknown" function.
Experimental Procedures
Target Selection
10 protein structural domains were chosen from the version 1.37 of the SCOP database
(Murzin et al., 1995; Brenner et al., 1996; Hubbard et al., 1997; found at
http://pdb.pdb.bnl.gov/scop/). Domains were randomly chosen from the following four classes:
(a) all alpha
(b) all beta
(c) alpha and beta (a/b) mainly parallel beta sheets
(d) alpha and beta (a+b) mainly antiparallel beta sheets
so that 3 domains were each obtained from classes 1 and 3 and 2 domains were each obtained
from classes 2 and 4.
Structure Search & Comparison
Each of the 10 domains obtained from the SCOP database correspond to structures
housed in the PDB (Abola et al., 1987; Bernstein, et al., 1977), with the PDB entry for only one
of the targets (1gnd) containing more residues than the domain because it is non-contiguously
composed. Files containing the three dimensional coordinates for each of the domains were
obtained from the February 9, 1998 version of the PDB. The coordinates of each the 10 target,
PDB entries were submitted interactively to the DALI server (Holm & Sander, 1993; found at
http://www2.ebi.ac.uk/dali/), and a file reporting PDB-housed protein chains structurally similar
to the submitted query was returned. DALI uses an algorithm implementing distance matrix
representations to compare C values of a peptide chain against C values of other protein
chains. The output file contains structural neighbors that are ranked by z-score, which is a
length-dependent, statistical significance representation of the protein chain’s similarity score to
the submitted protein chain. This provides a universal quantitative measure for the strength of
the comparison against a specific population. All of the structural neighbors provided by DALI
have z scores greater than or equal to 2.0, meaning that they all show statistical significance in
structure similarity to the target protein chain (Holm and Sander, 1994).
Sequence Similarity Searching
To judge how the structural neighbors were related in amino acid sequence to the target
peptide, version 3.06 of FASTA was used (Pearson & Lipman, 1988; Lipman & Pearson, 1985).
FASTA is a program that searches for sequence similarity locally and allows for the introduction
of gaps during sequence alignments (Pearson, 1996). It ranks the sequence neighbors by
expected (E) score, which is a statistical representation of how likely it is to obtain a particular
sequence similarity score while searching a sequence against a particular population. A local
version of this program was run, in which the sequences of each of the 10 protein targets was
searched against a local version of the nonredundant (NR) protein sequence database of NCBI
(found at the BLAST ftp site ftp://ncbi.nlm.nih.gov/blast). The database contains all of the
protein sequences, identification codes, and header info from GenBank , CDS translations, the
PDB, SWISS-PROT, and PIR. For each of the 10 FASTA searches, the following parameters
were used in searching the 309,341 sequences of NR database: ktup = 2, gap-pen = 12/-2,
BLOSUM 62 scoring matrix from the NCBI BLAST ftp site. To correlate the sequence
comparison results with the structural neighbors, it was necessary to search for the PDB id codes
of each of the structural neighbors in the NR database file, and subsequently use the gi’s
assigned to the PDB entries to probe the FASTA search results.
Elimination of Redundancies
Due to the PDB being highly redundant in the structures that it houses (Holm & Sander,
1988), it was necessary to represent structural neighbors that were the same molecule with just
one molecule in each of the DALI searches. The NR database, which groups protein entries of
identical amino acid sequence from different databases and within the same database, was used
to discern which of the structural neighbors were 100% redundant with each other in each of the
DALI searches. In choosing which PDB entry should represent its set of redundant structural
neighbors, two criteria were first implemented:
a) For a set of redundant PDB entries, we chose to represent the set with the entry of the lowest z
score. This method ensures that selection for structural neighbors, when necessary, is done in a
conservative manner.
b) Because the target molecule was included in the list of structural neighbors, it sometimes
appeared in sets of redundant molecules. When this occurred, the target molecule was chosen to
represent the set of redundant molecules.
Using these criteria resulted in a 45% reduction of total structural neighbors, most prominent in
the structural neighbors of trypsin (1sgt) and dihydrofolate reductase (1dhf-a) because these
proteins are highly investigated due to their roles in therapeutic approaches to disease. However,
the preservation of accurate structural representation was a concern because there was some
variation in structural similarity within sets of redundant entries. For the 49 sets of redundancy
among the structural neighbors, the values for the range, the lowest z score among the redundant
set subtracted from the highest z score among that same redundant set, showed the following
properties: median of 1.1, mean of 2.71, and standard deviation of 4.43. There was not a very
large difference in structures within the redundant sets except for a few extremes. In these
extremes, one structure was very different from the rest of the structures within the set due to
different ligands associated with the molecule, different conformations for different subunits of
the protein, etc. To allow for accurate structural representation of the redundant set of structural
neighbors, these extreme values were excluded from the sets of redundancies and included in the
pool of nonredundant structural neighbors. This resulted in a total of 396 non-redundant,
structural neighbors, a 44% from the total structural neighbors. The range values for the
redundant sets then showed the following properties: median of 0.8, mean of 1.6, and standard
deviation of 1.60. In addition, because very few redundancies were found in the set of structural
neighbors that showed poor sequence similarity (E score > 10) to their targets, these PDB entries
were manually inspected for evidence of being from the same protein from the same organism.
No redundancies for found this way.
Functional Investigation
Biochemical and cellular function information was obtained for each of the
nonredundant, PDB chains that showed poor or no sequence similarity (E score > 10) with its
target by searching primarily the functional activity areas of the SWISS-PROT (Bairoch &
Apweiler, 1998; Apweiler et al., 1997) protein entry for the chain’s encompassing protein.
“SWISS PROT is a curated protein sequence database which strives to provide a high level of
annotation (such as the description of the function of a protein …)” (from
http://expasy.hcuge.ch/sprot). When SWISS-PROT did not provide enough functional info,
additional information was retrieved from literature and the header of the PDB entry.
Results and Discussion
Targets
The simulated experiment was conducted as though the structures of 10 domains, which
can be found in the following PDB entries and are designated by SCOP, were solved. The
following contains the PDB id, entry name, and source organism.
(a) 2aak (formerly 1aak), ubiquitin conjugation enzyme from Arabidopsis thaliana (Cook et al.,
1992).
(b) 1cks (chain A), cyclin-dependent kinase subunit type 2 (CksHs 2) from Homo sapiens (Parge
et al., 1993).
(c) 3sdh (chain A), hemoglobin I (homodimer, unliganded and carbon monoxide liganded states)
from Scapharca inaequivalvis (Royer, 1994; Royer et al., 1990, 1989).
(d) 1vih, vigilin, repeat 6 from Homo sapiens (Musco et al., 1996; Castiglione Morelli et al.,
1995).
(e) 1dhf (chain A), dihydrofolate reductase (DHFR) complex with folate from Homo sapiens
(Davies et al., 1990; Prendergast et al., 1988).
(f) 1hoe, alpha-amylase inhibitor Hoe-467A from Streptomyces Tendae 4158 (Pflugrath et al.,
1986).
(g) 2wrp (chain R), trp repressor (orthorhombic form) from Escherichia coli (Lawson et al.,
1988, Zhang et al., 1987; Schevitz et al., 36; Joachimiak et al., 1983, 1983; Gunsalus &
Yanofsky, 1980).
(h) 1sgt, trypsin from Streptomyces griseus, strain K1 (Read & James, 1988).
(i) 1fps, farnesyl diphosphate synthase from Gallus gallus (Tarshis et al., 1994).
(j) 1gnd, guanine nucleotide dissociation inhibitor (alpha isoform) from Bos taurus (Schalk et
al., 1996, 1994).
Structure vs. Sequence
The amount of structurally similar, nonredundant PDB entries that would not be found by
sequence similarity search using FASTA and the parameters indicated is almost 3 times greater
than those that would have been found by sequence similarity search (see Figures 1 & 2). 8 of
the 10 searches resulted in nonredundant structures that were similar to the target domain but
retained very poor sequence similarity (E > 10). In fact, 284 of the 287 PDB entries that make
up the “Not Probable” category had E score over 100, which is well into the region populated by
non-homologous peptides. Among the random sample of domains chosen were ones from
trypsin (1sgt) and dihydrofolate reductase (1dhf-a), both the subject of many structural studies
due to their biological and potential therapeutic effects. The nearly three quarters of total hits
being of poor sequence similarity would certainly be higher in proportion if the hits, many being
orthologues (same protein, different organism) of the 2 targets, of 1sgt and 1dhf-a were derived
from less studied structures.
Structural similarity seems to be conserved as sequence similarity diminishes (see Figure
3). While it is true that many of the more significant structural matches, including the targets
matched upon themselves, are found in the highly probably region of the graph, almost 3
quarters of the structurally significant hits are clustered in the poor sequence similarity region
(NP) of the graph. 275 of the 287 hits in the NP region are crowded in the area with log (E) = 2
and 2  z  10.
Functional Analysis
Poor sequence similarity, structural hits can provide a wealth of biochemical and cellular
information about the target structure’s function(s). Without knowledge of the hits that would
probably be detected by traditional sequence similarity search methods, including those in the
“twilight zone”, a list of candidate functions can be developed to gain insight into the
biochemical and cellular function(s) of the target. This was possible in 8 of the following 10
cases (See table 1).
1sgt
Trypsin, a well-studied serine protease, yielded only 8 poor sequence similarity,
structural hits. The probable function for this target domain was easy to spot because most of its
structural hits had the same function. Also, the precursor origin of this enzyme, as it does come
from trypsinogen, is suggested by half of the hits coming from zymogens. This is a clear
example of detection of homologues beyond the grasp of sequence similarity detection but not
structural similarity detection.
1dhf-a
Dihydrofolate reductase (DHFR), another well-studied enzyme, participates in a twice
reduction of folate to dihydrofolate and ultimately tetrahydrofolate with the assistance of
NADPH (Garret & Grisham, 1995). DHFR’s binding to specific portions of NADPH is
suggested by the candidate functions. 85% of its hits binding to purine-based nucleotides, with
emphasis placed on adenine-containing nucleotides, suggests that the target might bind to a
similar molecule. DHFR does bind adenosine, a component of NADPH (Zheng et al., 1993;
Basran et al., 1997). In addition, this candidate function agrees with the finding that DHFR’s
interaction with the pyrophosphate moiety of NADPH provides most of the energetically
favorable binding interactions in the NADPH-DHFR complex (Bystroff et al., 1990). 50% of the
hits provide phosphorylation or kinase functions, which suggests that the target may interact with
phosphate group(s). In addition to its pyrophosphate-binding function, DHFR has a specific,
basic residue that binds to charged oxygens of the ribose 2’-phosphate group of NADPH and
plays and important role in binding the coenzyme (Gargaro et al., 1996; Bystroff and Kraut,
1991). Studies have also shown that this phosphate binding site is why DHFR interacts
quantitatively much better with NADPH than NADH, and this may explain why NADPH is the
preferred coenzyme for DHFR (Huang et al., 1990; Rancourt & Walker, 1990). Thus, in this
case the 2 major candidate functions could lead to speculation of the correct coenzyme.
3sdh
Hemoglobin is a plasma protein involved in the transportation of oxygen. It’s oxygenbinding capacity is revealed by almost a third of its hits being globins, which show poor
sequence to the target domain yet still possess the necessary scaffolding to accompany the heme
group, the necessary component for oxygen transportation. It is possible light-harvesting
proteins (“LHP”s) turned up as a candidate, yet very minor, candidate function due to the
ligands that they bind. One of the three binds chlorophyll a, whose ring system has similarities
in shape to the porphyrin ring system of heme, and contained magnesium ion is of the same
charge as the iron contained in heme. Perhaps a homology is revealed which shows conservation
for binding to these ligands.
1vih
Vigilin consists of 14 repeats of the KH module, a domain known for its RNA-binding
properties (Musco et al., 1996). It was recently confirmed that vigilin binds to vitellogenin
mRNA (Dodson & Shapiro, 1998). Upon inspection of the candidate functions, it is very
possible to use the major candidate function of nucleic acid-binding to guess that this target
domain binds a similar molecule.
1hoe
Alpha amylase inhibitor prevents alpha-amylase from catalyzing the hydrolysis of
glycosidase linkages in starch and similar polysaccharides by occupying the receiving region of
the enzyme, imitating substrate interactions with the enzyme at specific residues, and causing
structural changes in the active site (Bompard-Giles et al., 1996).
2wrp-r
Tryptophan (trp) repressor binds to specific DNA sites and is involved in gene regulation.
While the sites that this molecule binds to are not yet suggested by the major candidate function,
its DNA-binding nature is. Perhaps, upon examination of the binding features of the hits that
lead to the development of the transcription regulation candidate function, it can determined
what are the characteristics of the DNA to which it would bind.
1fps
Farnesyl diphosphate (FPS) catalyzes a hydrocarbon chain elongation reaction using
isopentyl diphosphate and dimethylallyl diphosphate (Cunillera et al., 1997; Tarshis et al., 1996).
Hydrocarbon-binding is suggested by the major candidate function, and its magnesium-binding
is suggested by the next less popular candidate function.
1gnd
Guanine nucleotide dissociation inhibitor (alpha isoform, GDI) binds to Rab proteins and
prevents GDP from dissociating from these GTPase proteins (Wu et al., 1996; Shalk et al.,
1996). It is accepted that GDI shares high structural similarity with various flavoproteins (Wu et
al., 1996), yet it does not show binding activity with flavin-based molecules in the expected
regions suggested by motifs and structure (Shalk et al., 1996).
Conclusion
We found that 63% of the statistically insignificant sequence similarity, structure hits
had the reported functions that were used to derive the candidate functions that proved helpful in
determining the biochemical/cellular functions of the target domains. We also found that upon
examination of the candidate functions, the ones that appeared with great frequency (30-40% or
greater) within each of the individual searches proved to be helpful. While it is not likely that all
of the useful hit structures are not homologues of their target domains, the robustness of the
ability of structure to predict function is similar to a recent study showing that structural
comparison had a superfamily (same fold, similar function) homology detection rate twice as
good as sequence comparison when both were applied to the same population of homologous
proteins and leveled for the same error rate (Levitt & Gerstein, 1998).
Our work ties in very well with the recent emphasis placed by structural biologists on
solving the structures of orphan gene products for the discovery of new protein folds (Pennisi,
1998). The theory that non-radical degeneracy in amino acids is tolerated by evolution when
creating similar structures and thus function (Holm & Sander, 1996), with which our work
agrees, predicts that a proportion of the structures of the orphan gene products can be
successfully compared with existing structures that have been functionally characterized. The
functional insight gained by this approach can be used to screen for a small fraction of biological
assays necessary to measure the function(s) of these proteins. In addition, with the increase in
the novelty, quantity, and diversity of solved protein structures, this "structure-based approach to
functional genomics" will grow in strength and value.
Acknowledgement
We thank M. Mehnert for designing PERL scripts that were used to check much of the compiled
data.
References
Abola, E.E., Bernstein, F.C, Bryant, S.H., Koetzle, T.F., and Weng, J. (1987) in Protein
Data Bank, in Crystallographic Databases - Information Content, Software Systems, Scientific
Applications (Allen, F.H., Bergerhoff, g., and Sievers, R., eds.) pp.107-132 Data Commission of
the International Union of Crystallography, Bonn/Cambridge/Chester.
Apweiler, R., Gateau, A., Contrino, S., Martin, M.J., Junker, V., O'Donovan, C., Lang, F.,
Mitaritonna, N., Kappus, S., Bairoch, A. (1997) in ISMB-97 Proceedings 5th
International Conference on Intelligent Systems for Molecular Biology pp 33-43, AAAI
Press, Menlo Park.
Bairoch, A. and Apweiler, R. (1998) NAR 26, 38-42.
Basran, J., Casaratto, M.G., Basran, A., and Roberts, G.C. (1997) Protein Engineering 10, 815826.
Bernstein, F.C., Koetzle, T.F., Williams, G.J.B., Meyer Jr., E.F., Brice, M.D., Rodgers, J.R.,
Kennard, O., Shimanouchi, T., and Tasumi, M. (1977) J. Mol. Biol. 112, 535-542.
Bompard-Giles, C., Rousseau, P., Rouge, P., and Payan, F. (1996) Structure 4, 1441-1452.
Botstein, D., Chervitz, S.A., and Cherry, J.M. (1997) Science 277, 1259-1260.
Brenner, S., Chothia, C., Hubbard, T.J.P., and Murzin, A.G. (1996) Meth. Enzymol. 266, 635642.
Bystroff, C., Oatley, S.J., and Kraut, J. (1990) Biochemistry 29, 3263-3277.
Bystroff, C. and Kraut, J. (1991) Biochemistry 30, 2227-2239.
Casari, G., De Daruvar, A., Sander, C., and Schneider, R. (1996) Trends Genet. 12, 244-245.
Castiglione Morelli, M.A., Stier, G., Gibson, T.G., Joseph, C., Musco, G., Pstore, A., and Trave,
G. (1995) FEBHS Lett. 358, 193.
Chothia, C. (1992) Nature 357, 543-544.
Cook, W.J., Jeffrey, L.C., Sullivan, M.L., and Vierstra, R.D. (1992) J. Biol. Chem. 267, 15116.
Cunillera, N., Boronat, A., Ferrer, A. (1997) J. Biol. Chem. 272, 15381-15388.
Davies, J.F.2nd, Delcamp, T.J., Prendergast, N.J., Ashford, V.A., Freisheim, J.H., and Kraut, J.
(1990) Biochemistry 29, 9467-9479.
Dodson, R.E. and Shapiro, D.J. (1998) Mol. Cell. Biol. 18, 3991-4003.
Gargaro, A.R., Frenkiel, T.A., Nieto, P.M., Birdsall, B., Polshakov, V.I., Morgan, W.D., and
Feeney, J. (1996) European Journal of Biochemistry 238, 435-439.
Garret, R.H. and Grisham,C.M. (1995) Biochemistry p 498 (Saunders College Publishing and
Harcourt Brace College Publishers, Orlando, NJ).
Gunsalus, R.P. and Yanofsky, C. (1980) Proc. Natl. Acad. Sci. USA 77, 1980.
Holm, L. and Sander, C. (1993) J. Mol. Biol. 223, 123-138.
Holm, L. and Sander, C. (1994) NAR 22, 3600-3609.
Holm, L. and Sander, C. (1996) Science 273, 595-602.
Holm, L. and Sander, C. (1997) NAR 25, 231-234.
Holm, L. and Sander, C. (1998) Bioinformatics 14, 423-429.
Huang, S., Appleman, R., Tan, X.H., Thompson, P.D., Blakley, R.L, Sheridan, R.P.,
Venkataraghavan, R., and Freisheim, J.H. (1990) Biochemistry 29, 8063-8069.
Hubbard, T.J.P., Murzin, A.G., Brenner, S.E., and Chothia, C. (1997) NAR 25, 236-239.
Joachimiak, R.A., Schevitz, R.W., Kelley, R.L., Yanfosky, C., and Sigler, P.B. (1983) J. Biol.
Chem. 258, 12641.
Joachimiak, A., Kelley, R.L., Gunsalus, R.P., Yanofsky, C., and Sigler, P.B. (1983) Proc. Natl.
Acad. Sci. USA 80, 683.
Lawson, C.L., Zhang, R.-G., Schevitz, R.W. Otwinowski, Z. Joachimiak, A., and Sigler, P.B.
(1988) Proteins, Structure, Function, Genetics 3, 18.
Levitt, M, and Gerstein, M. (1998) Proc. Natl. Acad. Sci. USA 95, 5913-5920.
Lima, C.D., Klein, M.G., and Hendrickson, W.A. (1997) Science 278, 286-290.
Lipman, D.J., and Pearson, W.R. (1985) Science 227, 1435.
Murzin, A., Brenner, S.E., Hubbard, T., and Chothia, C. (1995) J. Mol. Biol. 247, 536-540.
Musco, G., Stier, G., Joseph, C., Castiglione Morelli, M.A., Nilges, M., Gibson, T.J., and
Pastore, A. (1996) Cell 85, 237-245.
Parge, H.E., Arvai, A.S., Murtari, D.J., Reed, S.I., and Tainer, J.A. (1993) Science 262, 387.
Pearson, W.R. and Lipman, D.J. (1988) Proc. Natl. Acad. Sci. USA 85, 2444.
Pearson, W.R. (1996) Meth. in Enzymol. 266, 227-258.
Pennisi, E. (1998) Science 279, 978-979.
Pflugrath, J.W., Wregand, G., Huber, R., and Vertesy, L. (1986) J. Molec. Biol. 189, 383-386.
Prendergast, N.J., Delcamp, T.J., Smith, P.L., and Freisheim, J.H. (1988) Biochemistry 27,
3664-3671.
Rancourt, S.L. and Walker, V.K. (1990) Biochem. Cell. Biol. 68, 1075-1082.
Read, R.J. and James, M.N. (1988) J. Molec. Biol. 200, 523-551.
Royer, W.E.J., Hendrickson, W.A., and Chiancone, E. (1989) J. Biol. Chem. 264, 21052-21061.
Royer, W.E.J., Hendrickson, W.A., and Chiancone, E. (1990) Science 249, 518-521.
Royer, W.E.J. (1994) J. Molec. Biol. 235, 657-681.
Schalk, I.J., Stura, E.A., Matteson, J., Wilson, I.A., and Balch, W.E. (1994) J. Molec. Biol. 244,
469.
Schalk, I., Zeng, K., Wu, S.K., Stura, E.A., Matteson, J., Huang, M., Tandon, A., Wilson, I.A.,
and Balch, W.E. (1996) Nature 381, 42.
Schevitz, R.W., Otwinowski, Z., Joachimiak, A., Lawson, C.L., and Sigler, P.B. (1985) Nature
317, 782.
Shalk, I., Zeng, K., Wu, S.-K., Stura, E.A., Matteson, J., Huang, M., Tandon, A., Wilson, I.A.,
and Balch, W.E. (1996) Nature 381, 42-48.
Tarshis, L.C., Yan, M., Poulter, C.D., and Sacchettini, J.C. (1994) Biochemistry 33, 1087110877.
Tarshis, L.C., Proteau, P.J., Kellogg, B.A., Sacchettini, J.C., and Poulter, C.D. (1996) Proc. Natl.
Acad. Sci. USA 93, 15018-15023.
Wu, S.-K., Zeng, K., Wilson, I.A., and Balch, W.E. (1996) Trends in Biochemical Sciences 472476.
Zhang, R.-G., Joachimiak, A., Lawson, C.L., Schevitz, R.W., Otwinowski, Z., and Sigler, P.B.
(1987) Nature 327, 891.
Zheng, J., Chen, Y.Q., Callender, R. (1993) European Journal of Biochemistry 215, 9-16.
Table 1: Summary of the frequency of occurrence of candidate functions consistently appearing
as keywords for poor sequence similarity structural hits of target domains a
 1sgt - hydrolase, serine protease, zymogen
 8/8 hydrolases
 8/8 proteases
 7/8 serine proteases
 1/8 thiol protease
 4/8 zymogens
 1dhf-a - oxidoreductase, reductase, dehydrogenase, NADPH-binding
 12/14 purine-based nucleotide-binding
 2/14 NADP+ or NADPH binding
 2/14 GTP-binding
 7/14 ATP-binding

7/14 phosphorylation or kinases

3/14 oxidoreductases
 3sdh - oxygen transport
 6/19 oxygen transport or storage
 3/19 acetylation or methylation
 3/19 light-harvesting protein
 3/19 DNA-binding
 1vih - ribonucleoprotein, RNA-binding
 14/29 nucleic acid-binding
 9/29 DNA-binding
 7/29 RNA-binding
 7/29 transferases
 nucleotidyltransferases, phosphotransferases, acetyltransferase, glycosyltransferase
 5/29 transcription regulation or stimulation
 3/29 acetylation and/or methylation
 1hoe - glycosidase inhibitor, alpha amylase inhibitor, disulfide linkages, mimicry of
polysaccharides
 26/33 carbohydrate-binding or carbohydrate-bound proteins
 11/33 glycoproteins
 6/33 cell adhesion proteins
 6/33 immunoglobulins
 3/33 glycosidases
 3/33 anti-glycosidases, glycosyltransferase, acetyltransferase
 8/33 Immunoglobulin fold proteins
 3/33 calcium-binding
 2wrp-R - DNA-binding, transcription regulation, repressor, protects DNA from endonuclease
activity
 27/37 DNA-binding
 17/37 transcription regulation
 5/37 repressor
 7/37 activator
 11/37 nuclear proteins
 1fps - prenyltransferase, isoprene biosynthesis, isoprenoid-binding, isoprenoid elongation,
magnesium-binding
 21/67 carbohydrate-binding or carbohydrate-bound proteins
 11/67 glycoproteins
 7/67 carbohydrate transferase
 methyltransferase, glycosyltransferase, sugar transport, pentosyltransferase,
methylation
 20/67 metal-binding
 8/67 heme-binding
 7/67 iron-binding
 bind to magnesium, calcium, cobalt, zinc, vanadium
 15/67 nucleic acid-binding
 11/67 DNA-binding
 9/67 nucleotide-binding
 1gnd - GTPase activation inhibitor, Rab protein-binding, guanine nucleotide dissociation
inhibitor
 54/80 bind to nucleotides, nucleosides, or nucleotide composed molecules
 49/80 bind to purine-based molecules
 45/80 bind to adenine-based molecules
 34/80 bind to NAD-based molecules
 11/80 bind to FAD
 38/80 oxidoreductases
 13/80 metal-binding
 6/80 zinc-binding
 3/80 bind to pyrimidine-based molecules
 thymine-binding, uracil-binding
 3/80 nucleic acid-binding
a
PDB id codes are included for each of the target domains, and their functions follow the
hyphen. It is important to note that this represents a summary of how often a certain function
was reported for a PDB entry. If a protein has two candidate functions, it is counted once for
each of the candidate functions among the hits of a target domain. 2aak and 1cks-a were not
included due to lack of poor sequence similarity hits.
Figure Legends
Figure 1: Sequence similarity categorization for nonredundant structural similarity hits. The
columns show results for searches done with each of the target domains, which are represented
by their PDB id codes (including their chain designations if applicable). The sequence similarity
values were determined by searching against the NCBI NR database with FASTA. The E scores
obtained from FASTA were used to indicate likelihood of finding the hit through sequence
similarity search as follows: highly probable  E  0.02, twilight zone  0.02 < E  10, not
probable  E > 10. Cutoffs for categories were determined from literature survey (Pearson,
1996), and possible homology detection via sequence similarity search is given the benefit of the
doubt in the fringe area between “Not Possible” and “Twilight Zone” (personal communication
with Bruccoleri, R.E.). Dot density corresponds to probability in determining possible homology
via sequence similarity.
Figure 2: All nonredundant structural hits of 10 target domains categorized by sequence
similarity. The whole pie represents all of the 398 nonredundant structural hits obtained from the
10 DALI searches.
Figure 3: z score vs. log (E score) for all nonredundant structural hits. HP: highly probable
detection of similarity by sequence, TZ: in the “twilight zone” of detection by sequence
similarity, NP: not probable detection of similarity by sequence. Each of the symbols represents
structural hits obtained from comparison of the PDB entry indicated in the legend box. The
numbers next to the PDB id codes in the legend box are, in order, the number of points for each
search, the z score received for the target perfectly matched against itself, and the length of the
target. The PDB entries that were not detected during FASTA searches (E score > 100) were
assigned an E score of 100. PDB entries that received an E score > 10-25 were assigned an E
score of 10-25.
Twilight Zone
5%
Not Probable
72%
Highly Probable
23%
Figure 2
The End
Download